Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fuzzy Group with -strgroup- : How to "Match" Similar Strings under Certain Conditions?

    Dear Statalist Community,


    I have a database of cars sold new in Spain for several years up to 2019. I have several models that are similar, but not completely identical.
    I know that brands should be avoided on some forums (Stack Overflow, for example). For this reason, I am removing the car brand from my -dataex- below, for consistency with other forums.

    I want to make a "fuzzy" group by grouping the more or less identical models with Julian Reif's -strgroup-. This command is available from:

    Code:
     net install strgroup, from("https://raw.githubusercontent.com/reifjulian/strgroup/master") replace
    or from SSC:

    Code:
     ssc install strgroup, replace
    To give more context, here is a dataex:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str170 modelo str9 periodo str4 cc str6 cilind str3 gd str9 pkw str17 cvf str5 co2 str4 cv str6 valor
    "V 60 T6 Momentum Aut. 306"                            "-2013"     "1969" "4" "G" "225" "13.2"  "157" "306" "36800"
    "V 60 T6 Momentum AWD Aut."                            "-2013"     "2953" "6" "G" "224" "19.79" "237" "304" "41200"
    "V 60 T6 R-Design Momentum Aut. 306"                   "-2013"     "1969" "4" "G" "225" "13.2"  "157" "306" "38700"
    "V 60 T6 R-Design Momentum AWD Aut."                   "-2013"     "2953" "6" "G" "224" "19.79" "237" "304" "43200"
    "V 60 T6 Summum Aut. 306"                              "-2013"     "1969" "4" "G" "225" "13.2"  "157" "306" "39400"
    "V 60 T6 Summum AWD Aut."                              "-2013"     "2953" "6" "G" "224" "19.79" "237" "304" "42900"
    "V 70  T6 Momentum AWD                               " "2007-2009" "2953" "6" "G" "209" "19.79" "270" "284" "39400"
    "V 70  T6 R-Design AWD"                                "2007-2009" "2953" "4" "G" "209" "16.83" "270" "284" "43700"
    "V 70  T6 Summum AWD                               "   "2007-2009" "2953" "6" "G" "209" "19.79" "270" "284" "42000"
    "V 70 2.0 Aut."                                        "1997-2000" "1984" "5" "G" "93"  "14.49" ""    "127" "17800"
    "V 70 2.0"                                             "1997-2000" "1984" "5" "G" "93"  "14.49" ""    "127" "16900"
    "V 70 2.0D Kinetic"                                    "2007-2013" "1997" "4" "D" "100" "13.31" "157" "136" "28000"
    "V 70 2.0D Momentum"                                   "2007-2013" "1997" "4" "D" "100" "13.31" "157" "136" "30100"
    "V 70 2.0D Summum"                                     "2007-2013" "1997" "4" "D" "100" "13.31" "157" "136" "32900"
    "V 70 2.0F Kinetic"                                    "2007-2009" "1999" "4" "M" "107" "13.32" "206" "146" "27300"
    "V 70 2.0F Momentum"                                   "2007-2009" "1999" "4" "M" "107" "13.32" "206" "146" "29400"
    "V 70 2.0F Summum"                                     "2007-2009" "1999" "4" "M" "107" "13.32" "206" "146" "32200"
    "V 70 2.3 T5 Optima Aut"                               "2000-2004" "2319" "5" "G" "184" "15.92" ""    "250" "31300"
    "V 70 2.4 140 Aut"                                     "2000-2004" "2435" "5" "G" "103" "16.39" ""    "140" "21800"
    "V 70 2.4 140 Optima Aut"                              "2000-2004" "2435" "5" "G" "103" "16.39" ""    "140" "23800"
    "V 70 2.4 140 Optima"                                  "2000-2004" "2435" "5" "G" "103" "16.39" ""    "140" "22800"
    end



    For example, I'd like to group the "V 60 T6" models, "V 70 2.4" models, the "V 70 2.0" models, and so on through my data set. If possible. But with some conditions:
    1. I'd like to group them according to their cubic capacity (represented by the -c.c.- variable),
    2. and their commercial period (variable -period- above), please.
    Note: The "-2013" value above is poorly written. It should be instead "2013-". This needs to be cleaned.

    The final idea is to calculate the average price for the grouped models. The price for each model above is represented by -valor-. I need then to merge this dataset with another one from completely different sources and name conventions differ, which will merit another post from me soon.


    Thank you in advance for your help.

    Best regards,

    Michael
    Last edited by Michael Duarte Goncalves; 04 Jan 2024, 03:53.
Working...
X