Hi everyone,
I have a complex task that I need your help with. Here are two sets of data:
Dataset 1 example:
Dataset 2 example:
I tried the following code for both datasets using SSC command -strgroup- :
For the first dataset shown above:
Basically, I want to merge the model's string variables which are similar. From the dataset 2, I don't have the full exhaustive model's name as in Dataset 1.
However, we have the year of registration, represented by -FEC_MATRICULA-, and the engine capacity, represented by -cilindrada-. We also have the fuel in -COD_PROPULSION-, where 0 means petrol and 1 diesel. These variables should match those in the 1st data set, even if we don't have all the information:
Then I need to calculate the average price of these models, based on these characteristics please.
Could anyone give me some suggestions to do that, please? I am stuck and I don't know how to proceed!
Thank you!
Michael
I have a complex task that I need your help with. Here are two sets of data:
Dataset 1 example:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str77 model str9 periodocomercial int cc str3 gd long(price similar_model) "S 60 D3 R-DESIGN AUT. (CO2 142)" "2010-2013" 1984 "D" 29000 51 "S 60 D3 R-DESIGN AUT. 163CV" "2010-2013" 1984 "D" 29900 51 "S 60 D3 R-DESIGN KINETIC (2013-)" "2013-2015" 1984 "D" 28100 88 "S 60 D3 R-DESIGN KINETIC AUT. (2013-)" "2013-2015" 1984 "D" 30100 88 "S 60 D3 R-DESIGN MOMENTUM (2013-)" "2013-2015" 1984 "D" 29100 88 "S 60 D3 R-DESIGN MOMENTUM 150" "2015-" 1969 "D" 30800 88 "S 60 D3 R-DESIGN MOMENTUM AUT. (2013-)" "2013-2015" 1984 "D" 31200 88 "S 60 D3 R-DESIGN MOMENTUM AUT. 150" "2015-" 1969 "D" 32400 88 "S 60 D3 SUMMUM" "2010-2013" 1984 "D" 28400 54 "S 60 D3 SUMMUM (2013-)" "2013-2015" 1984 "D" 28800 51 "S 60 D3 SUMMUM 150" "2015-" 1969 "D" 30500 51 "S 60 D3 SUMMUM 163CV" "2010-2013" 1984 "D" 28900 51 "S 60 D3 SUMMUM AUT. (2013-)" "2013-2015" 1984 "D" 30800 51 "S 60 D3 SUMMUM AUT. (CO2 126)" "2010-2013" 1984 "D" 31200 51 "S 60 D3 SUMMUM AUT. (CO2 142)" "2010-2013" 1984 "D" 30500 51 "S 60 D3 SUMMUM AUT. 150" "2015-" 1969 "D" 32100 51 "S 60 D3 SUMMUM AUT. 163CV" "2010-2013" 1984 "D" 30500 51 "S 60 D4 KINETIC" "2010-2013" 1984 "D" 25800 50 "S 60 D4 KINETIC (2013-)" "2013-2015" 1984 "D" 26700 51 "S 60 D4 KINETIC 181" "2013-2015" 1969 "D" 27400 51 "S 60 D4 KINETIC AUT." "2010-2013" 1984 "D" 27700 51 end
Dataset 2 example:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str8 FEC_MATRICULA str31 model long cilindrada str1 COD_PROPULSION "14122018" "XC40" 1477 "0" "18102017" "S60" 1969 "0" "26022015" "XC60" 2400 "1" "28012016" "V60 TWIN ENGINE" 2400 "1" "22052017" "V40 CROSS COUNTRY" 1969 "1" "25072019" "S60" 1969 "0" "29062016" "V60 CROSS COUNTRY" 1969 "1" "25022016" "XC70" 1969 "1" "25012016" "V40" 1969 "1" "22102018" "V90 CROSS COUNTRY" 1969 "0" "05062017" "XC60" 1969 "1" "29122016" "V40" 1969 "1" "31082018" "S60" 1969 "1" "30042018" "V60" 1969 "0" "29122015" "V40" 1969 "1" "10072015" "XC60" 1969 "1" "31082017" "V40 CROSS COUNTRY" 1969 "1" "28062019" "V40" 1969 "0" "15112016" "S60" 1969 "1" "30032016" "S60" 1969 "0" "14082015" "V40" 1969 "1" end
I tried the following code for both datasets using SSC command -strgroup- :
For the first dataset shown above:
Code:
gen num = 1 collapse (sum) num, by(description model periodocomercial cc gd pkw cvf cv price) strgroup model, generate(similar_model) threshold(0.25) first normalize(shorter) force
Basically, I want to merge the model's string variables which are similar. From the dataset 2, I don't have the full exhaustive model's name as in Dataset 1.
However, we have the year of registration, represented by -FEC_MATRICULA-, and the engine capacity, represented by -cilindrada-. We also have the fuel in -COD_PROPULSION-, where 0 means petrol and 1 diesel. These variables should match those in the 1st data set, even if we don't have all the information:
- FEC_MATRICULA must fall within the marketing period of Dataset 1 (-periodocomercial-).
- The engine capacity must also be identical (-cilindrada- should be of the same value as -cc- in the Dataset 1).
- -COD_PROPULSION- should be equal to 1 when -g/d- is equal to D (for diesel).
Then I need to calculate the average price of these models, based on these characteristics please.
Could anyone give me some suggestions to do that, please? I am stuck and I don't know how to proceed!
Thank you!
Michael
Comment