String Variables - Special Cleaning: How do you clean up a variable from one data set to match another from another data set?

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 500

String Variables - Special Cleaning: How do you clean up a variable from one data set to match another from another data set?

09 Jan 2024, 09:39

Hi everyone,

I have a complex task that I need your help with. Here are two sets of data:

Dataset 1 example:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str77 model str9 periodocomercial int cc str3 gd long(price similar_model)
"S 60 D3 R-DESIGN AUT. (CO2 142)"        "2010-2013" 1984 "D" 29000 51
"S 60 D3 R-DESIGN AUT. 163CV"            "2010-2013" 1984 "D" 29900 51
"S 60 D3 R-DESIGN KINETIC (2013-)"       "2013-2015" 1984 "D" 28100 88
"S 60 D3 R-DESIGN KINETIC AUT. (2013-)"  "2013-2015" 1984 "D" 30100 88
"S 60 D3 R-DESIGN MOMENTUM (2013-)"      "2013-2015" 1984 "D" 29100 88
"S 60 D3 R-DESIGN MOMENTUM 150"          "2015-"     1969 "D" 30800 88
"S 60 D3 R-DESIGN MOMENTUM AUT. (2013-)" "2013-2015" 1984 "D" 31200 88
"S 60 D3 R-DESIGN MOMENTUM AUT. 150"     "2015-"     1969 "D" 32400 88
"S 60 D3 SUMMUM"                         "2010-2013" 1984 "D" 28400 54
"S 60 D3 SUMMUM (2013-)"                 "2013-2015" 1984 "D" 28800 51
"S 60 D3 SUMMUM 150"                     "2015-"     1969 "D" 30500 51
"S 60 D3 SUMMUM 163CV"                   "2010-2013" 1984 "D" 28900 51
"S 60 D3 SUMMUM AUT. (2013-)"            "2013-2015" 1984 "D" 30800 51
"S 60 D3 SUMMUM AUT. (CO2 126)"          "2010-2013" 1984 "D" 31200 51
"S 60 D3 SUMMUM AUT. (CO2 142)"          "2010-2013" 1984 "D" 30500 51
"S 60 D3 SUMMUM AUT. 150"                "2015-"     1969 "D" 32100 51
"S 60 D3 SUMMUM AUT. 163CV"              "2010-2013" 1984 "D" 30500 51
"S 60 D4 KINETIC"                        "2010-2013" 1984 "D" 25800 50
"S 60 D4 KINETIC (2013-)"                "2013-2015" 1984 "D" 26700 51
"S 60 D4 KINETIC 181"                    "2013-2015" 1969 "D" 27400 51
"S 60 D4 KINETIC AUT."                   "2010-2013" 1984 "D" 27700 51
end

Dataset 2 example:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str8 FEC_MATRICULA str31 model long cilindrada str1 COD_PROPULSION
"14122018" "XC40"              1477 "0"
"18102017" "S60"               1969 "0"
"26022015" "XC60"              2400 "1"
"28012016" "V60 TWIN ENGINE"   2400 "1"
"22052017" "V40 CROSS COUNTRY" 1969 "1"
"25072019" "S60"               1969 "0"
"29062016" "V60 CROSS COUNTRY" 1969 "1"
"25022016" "XC70"              1969 "1"
"25012016" "V40"               1969 "1"
"22102018" "V90 CROSS COUNTRY" 1969 "0"
"05062017" "XC60"              1969 "1"
"29122016" "V40"               1969 "1"
"31082018" "S60"               1969 "1"
"30042018" "V60"               1969 "0"
"29122015" "V40"               1969 "1"
"10072015" "XC60"              1969 "1"
"31082017" "V40 CROSS COUNTRY" 1969 "1"
"28062019" "V40"               1969 "0"
"15112016" "S60"               1969 "1"
"30032016" "S60"               1969 "0"
"14082015" "V40"               1969 "1"
end

I tried the following code for both datasets using SSC command -strgroup- :

For the first dataset shown above:

Code:

gen num = 1
collapse (sum) num, by(description model periodocomercial cc gd pkw cvf cv price)
strgroup model, generate(similar_model) threshold(0.25) first normalize(shorter) force

Basically, I want to merge the model's string variables which are similar. From the dataset 2, I don't have the full exhaustive model's name as in Dataset 1.
However, we have the year of registration, represented by -FEC_MATRICULA-, and the engine capacity, represented by -cilindrada-. We also have the fuel in -COD_PROPULSION-, where 0 means petrol and 1 diesel. These variables should match those in the 1st data set, even if we don't have all the information:

FEC_MATRICULA must fall within the marketing period of Dataset 1 (-periodocomercial-).
The engine capacity must also be identical (-cilindrada- should be of the same value as -cc- in the Dataset 1).
-COD_PROPULSION- should be equal to 1 when -g/d- is equal to D (for diesel).

I would like to have models with the same name (for example "S60"), the same engine capacity, a commercial period that matches and identical fuel grouped under the same model name.
Then I need to calculate the average price of these models, based on these characteristics please.

Could anyone give me some suggestions to do that, please? I am stuck and I don't know how to proceed!

Thank you!

Michael

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30101

09 Jan 2024, 10:14

Well, you could start out by -rangejoin-ing the two data sets on year of sale, with match required on fuel type and cc's. Then you could apply -matchit- to sort out the fuzzy match between model and model_U. To do this, you will have to make some changes to the data to make them compatible with each other. So, something like this:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear*
input str77 model str9 periodocomercial int cc str3 gd long(price similar_model)
"S 60 D3 R-DESIGN AUT. (CO2 142)"        "2010-2013" 1984 "D" 29000 51
"S 60 D3 R-DESIGN AUT. 163CV"            "2010-2013" 1984 "D" 29900 51
"S 60 D3 R-DESIGN KINETIC (2013-)"       "2013-2015" 1984 "D" 28100 88
"S 60 D3 R-DESIGN KINETIC AUT. (2013-)"  "2013-2015" 1984 "D" 30100 88
"S 60 D3 R-DESIGN MOMENTUM (2013-)"      "2013-2015" 1984 "D" 29100 88
"S 60 D3 R-DESIGN MOMENTUM 150"          "2015-"     1969 "D" 30800 88
"S 60 D3 R-DESIGN MOMENTUM AUT. (2013-)" "2013-2015" 1984 "D" 31200 88
"S 60 D3 R-DESIGN MOMENTUM AUT. 150"     "2015-"     1969 "D" 32400 88
"S 60 D3 SUMMUM"                         "2010-2013" 1984 "D" 28400 54
"S 60 D3 SUMMUM (2013-)"                 "2013-2015" 1984 "D" 28800 51
"S 60 D3 SUMMUM 150"                     "2015-"     1969 "D" 30500 51
"S 60 D3 SUMMUM 163CV"                   "2010-2013" 1984 "D" 28900 51
"S 60 D3 SUMMUM AUT. (2013-)"            "2013-2015" 1984 "D" 30800 51
"S 60 D3 SUMMUM AUT. (CO2 126)"          "2010-2013" 1984 "D" 31200 51
"S 60 D3 SUMMUM AUT. (CO2 142)"          "2010-2013" 1984 "D" 30500 51
"S 60 D3 SUMMUM AUT. 150"                "2015-"     1969 "D" 32100 51
"S 60 D3 SUMMUM AUT. 163CV"              "2010-2013" 1984 "D" 30500 51
"S 60 D4 KINETIC"                        "2010-2013" 1984 "D" 25800 50
"S 60 D4 KINETIC (2013-)"                "2013-2015" 1984 "D" 26700 51
"S 60 D4 KINETIC 181"                    "2013-2015" 1969 "D" 27400 51
"S 60 D4 KINETIC AUT."                   "2010-2013" 1984 "D" 27700 51
end

frame create ds2
frame ds2 {
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str8 FEC_MATRICULA str31 model long cilindrada str1 COD_PROPULSION
    "14122018" "XC40"              1477 "0"
    "18102017" "S60"               1969 "0"
    "26022015" "XC60"              2400 "1"
    "28012016" "V60 TWIN ENGINE"   2400 "1"
    "22052017" "V40 CROSS COUNTRY" 1969 "1"
    "25072019" "S60"               1969 "0"
    "29062016" "V60 CROSS COUNTRY" 1969 "1"
    "25022016" "XC70"              1969 "1"
    "25012016" "V40"               1969 "1"
    "22102018" "V90 CROSS COUNTRY" 1969 "0"
    "05062017" "XC60"              1969 "1"
    "29122016" "V40"               1969 "1"
    "31082018" "S60"               1969 "1"
    "30042018" "V60"               1969 "0"
    "29122015" "V40"               1969 "1"
    "10072015" "XC60"              1969 "1"
    "31082017" "V40 CROSS COUNTRY" 1969 "1"
    "28062019" "V40"               1969 "0"
    "15112016" "S60"               1969 "1"
    "30032016" "S60"               1969 "0"
    "14082015" "V40"               1969 "1"
    end
}

split period, gen(year) parse("-") destring
gen byte fuel_type = (gd == "D") if !missing(gd)
rename cc cilindrada

frame ds2 {
    gen year = real(substr(FEC_MATRICULA, -4, 4))
    destring COD_PROPULSION, gen(fuel_type)
    tempfile dataset2_clean
    save `dataset2_clean'
}
rangejoin year year1 year2 using `dataset2_clean', by(fuel_type cilindrada)

From here you can try applying -matchit- to variables model and model_U. That said, looking at the example data, it doesn't look to me like the model information in the two data sets is very matchable. When I look at every model from the second data set, none of them look like fuzzy matches to any of the models in the first. Unless these examples are highly unrepresentative, I think you will not get very good results. I can't think of anything to suggest that would help with this aspect of it.

-rangejoin- is written by Robert Picard, and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Comment

Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#3

10 Jan 2024, 01:00

Hi Clyde Schechter,

Thank you so much for your help and feedback!
Let's see if I can get some good matches (but as you said, it might not be optimal, let's see...).

Thanks again for everything!

Michael
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#4

10 Jan 2024, 07:22

Following some advices by daniel klein, I update this thread and give more details:

I'm working on a big project, with the aim of merging a database containing car registrations issued in Spain between 2015 and 2019. This database is of really poor quality. The strings are very dirty.
On the other hand, we have a database with the prices of new vehicles on sale covering several commercial periods. This is an exhaustive dataset.

The problem is that our key variable (model names) is not written in the same way at all across datasets unfortunately.
As a result, I have a lot of work ahead of me to harmonise the labels.

The Principal Investigator I'm working with isn't very keen on -matchit- or -reclink2-, because he thinks it's difficult to know exactly what these commands do, and also for problems stated in #2.

As a result, his suggestion is to use -strgroup- from SSC to manually "clean up" the data in the two databases of interest presented in #1.
The ultimate aim is to have identical model values.

However, I don't know how to start with -strgroup-, as it's tedious. Here are a few key elements that will give us a clearer idea of the right model:
the commercial period: it's important that the commercial period and the registration date match, as we're talking about new vehicles.

the cubic centimetres give us an idea of the model,

and type of fuel (petrol, diesel, electric, etc.)

Does anyone has a suggestion apart from -matchit- and -reclink2- ?

Thank you in advance!
Best,

Michael
Comment

Announcement

String Variables - Special Cleaning: How do you clean up a variable from one data set to match another from another data set?

Comment

Comment

Comment