Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data management with Strings

    Hello!

    Stata version: 16.1
    Issue: Don't know how to approach the cleaning of this dataset as all the relevant information are stored in strings

    Hello everyone.
    I'm working with a dataset on gasoline prices I retrieved in the french government website https://www.prix-carburants.gouv.fr/rubrique/opendata/.
    The format of the downloaded data is xml.

    After the following comands:
    Code:
    ssc install leftalign // Run only once
    ********************************************************************************
    **# Load data
    ********************************************************************************
    *Load the data
    clear
    infix str line 1-700 using ..\Raw\PrixCarburants_annuel_2023.xml               
    compress
    leftalign

    The dataset contains daily, retail, gasoline prices charged by the Universe of the French Gas-stations.

    I am left with a dataset that looks like this:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str104 line
    `"<pdv id="1000001" latitude="4620100" longitude="519800" cp="01000" pop="R">"'
    "<adresse>596 AVENUE DE TREVOUX</adresse>"                                     
    "<ville>SAINT-DENIS-L&#xE8;S-BOURG</ville>"                                    
    `"<horaires automate-24-24="">"'                                               
    `"<jour id="1" nom="Lundi" ferme="1"/>"'                                       
    `"<jour id="2" nom="Mardi" ferme="1"/>"'                                       
    `"<jour id="3" nom="Mercredi" ferme="1"/>"'                                    
    `"<jour id="4" nom="Jeudi" ferme="1"/>"'                                       
    `"<jour id="5" nom="Vendredi" ferme="1"/>"'                                    
    `"<jour id="6" nom="Samedi" ferme="1"/>"'                                      
    `"<jour id="7" nom="Dimanche" ferme="1"/>"'                                    
    "</horaires>"                                                                  
    "<services>"                                                                   
    "<service>Station de gonflage</service>"                                       
    "<service>Vente de gaz domestique (Butane, Propane)</service>"                 
    "<service>DAB (Distributeur automatique de billets)</service>"                 
    "</services>"                                                                  
    `"<prix nom="Gazole" id="1" maj="2023-01-02T07:53:26" valeur="1.867"/>"'       
    `"<prix nom="Gazole" id="1" maj="2023-01-05T09:33:37" valeur="1.877"/>"'       
    `"<prix nom="Gazole" id="1" maj="2023-01-09T14:51:49" valeur="1.875"/>"'       
    `"<prix nom="Gazole" id="1" maj="2023-01-11T09:23:54" valeur="1.859"/>"'       
    `"<prix nom="Gazole" id="1" maj="2023-01-13T09:07:40" valeur="1.862"/>"'       
    `"<prix nom="Gazole" id="1" maj="2023-01-16T09:34:16" valeur="1.885"/>"'       
    `"<pdv id="1000002" latitude="4621842" longitude="522767" cp="01000" pop="R">"'
    "<adresse>16 Avenue de Marboz</adresse>"                                       
    "<ville>BOURG-EN-BRESSE</ville>"                                               
    `"<horaires automate-24-24="">"'                                               
    `"<jour id="1" nom="Lundi" ferme=""/>"'                                        
    `"<jour id="2" nom="Mardi" ferme=""/>"'                                        
    `"<jour id="3" nom="Mercredi" ferme=""/>"'                                     
    `"<jour id="4" nom="Jeudi" ferme=""/>"'                                        
    `"<jour id="5" nom="Vendredi" ferme=""/>"'                                     
    `"<jour id="6" nom="Samedi" ferme=""/>"'                                       
    `"<jour id="7" nom="Dimanche" ferme=""/>"'                                     
    "</horaires>"                                                                  
    "<services>"                                                                   
    "<service>Vente de gaz domestique (Butane, Propane)</service>"                 
    "<service>DAB (Distributeur automatique de billets)</service>"                 
    "</services>"                                                                  
    `"<prix nom="Gazole" id="1" maj="2023-01-02T06:05:21" valeur="1.859"/>"'       
    `"<prix nom="Gazole" id="1" maj="2023-01-02T11:37:53" valeur="1.799"/>"'       
    `"<prix nom="Gazole" id="1" maj="2023-01-05T11:24:03" valeur="1.859"/>"'       
    end


    My objective is to end up with a panel dataset with:
    gas-station- id as the individual identifier
    the latest daily hour as the time identifier (e.g. if I have maj="2023-01-02T06:05:21" and maj="2023-01-02T11:37:53", I would pick the second one and call it 2023-01-02)
    all the gas station characteristics stored in the variables (e.g. latitude="4620100" longitude="519800", <adresse>596 AVENUE DE TREVOUX</adresse>, <ville>SAINT-DENIS-L&#xE8;S-BOURG</ville>)

    It is the first time I work with such a messy dataset and any guidance would be helpful.



  • #2
    This is typical of text data extracted from a webpage and the extraction procedure was recently discussed in https://journals.sagepub.com/doi/10....6867X241233678. This should get you started. Looking at

    `"<pdv id="1000001" latitude="4620100" longitude="519800" cp="01000" pop="R">"'
    and

    `"<pdv id="1000002" latitude="4621842" longitude="522767" cp="01000" pop="R">"'
    we can observe a pattern. Therefore:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str104 line
    `"<pdv id="1000001" latitude="4620100" longitude="519800" cp="01000" pop="R">"'
    "<adresse>596 AVENUE DE TREVOUX</adresse>"                                    
    "<ville>SAINT-DENIS-L&#xE8;S-BOURG</ville>"                                    
    `"<horaires automate-24-24="">"'                                              
    `"<jour id="1" nom="Lundi" ferme="1"/>"'                                      
    `"<jour id="2" nom="Mardi" ferme="1"/>"'                                      
    `"<jour id="3" nom="Mercredi" ferme="1"/>"'                                    
    `"<jour id="4" nom="Jeudi" ferme="1"/>"'                                      
    `"<jour id="5" nom="Vendredi" ferme="1"/>"'                                    
    `"<jour id="6" nom="Samedi" ferme="1"/>"'                                      
    `"<jour id="7" nom="Dimanche" ferme="1"/>"'                                    
    "</horaires>"                                                                  
    "<services>"                                                                  
    "<service>Station de gonflage</service>"                                      
    "<service>Vente de gaz domestique (Butane, Propane)</service>"                
    "<service>DAB (Distributeur automatique de billets)</service>"                
    "</services>"                                                                  
    `"<prix nom="Gazole" id="1" maj="2023-01-02T07:53:26" valeur="1.867"/>"'      
    `"<prix nom="Gazole" id="1" maj="2023-01-05T09:33:37" valeur="1.877"/>"'      
    `"<prix nom="Gazole" id="1" maj="2023-01-09T14:51:49" valeur="1.875"/>"'      
    `"<prix nom="Gazole" id="1" maj="2023-01-11T09:23:54" valeur="1.859"/>"'      
    `"<prix nom="Gazole" id="1" maj="2023-01-13T09:07:40" valeur="1.862"/>"'      
    `"<prix nom="Gazole" id="1" maj="2023-01-16T09:34:16" valeur="1.885"/>"'      
    `"<pdv id="1000002" latitude="4621842" longitude="522767" cp="01000" pop="R">"'
    "<adresse>16 Avenue de Marboz</adresse>"                                      
    "<ville>BOURG-EN-BRESSE</ville>"                                              
    `"<horaires automate-24-24="">"'                                              
    `"<jour id="1" nom="Lundi" ferme=""/>"'                                        
    `"<jour id="2" nom="Mardi" ferme=""/>"'                                        
    `"<jour id="3" nom="Mercredi" ferme=""/>"'                                    
    `"<jour id="4" nom="Jeudi" ferme=""/>"'                                        
    `"<jour id="5" nom="Vendredi" ferme=""/>"'                                    
    `"<jour id="6" nom="Samedi" ferme=""/>"'                                      
    `"<jour id="7" nom="Dimanche" ferme=""/>"'                                    
    "</horaires>"                                                                  
    "<services>"                                                                  
    "<service>Vente de gaz domestique (Butane, Propane)</service>"                
    "<service>DAB (Distributeur automatique de billets)</service>"                
    "</services>"                                                                  
    `"<prix nom="Gazole" id="1" maj="2023-01-02T06:05:21" valeur="1.859"/>"'      
    `"<prix nom="Gazole" id="1" maj="2023-01-02T11:37:53" valeur="1.799"/>"'      
    `"<prix nom="Gazole" id="1" maj="2023-01-05T11:24:03" valeur="1.859"/>"'      
    end
    
    gen id= trim(itrim(ustrregexra(line,`"<pdv id="(\d+)"\s+latitude(.*)"', "$1"))) if regexm(line, "pdv id=")
    gen latitude=trim(itrim(ustrregexra(line,`"(.*)latitude="(\d+)"\s+longitude(.*)"', "$2"))) if regexm(line, "pdv id=")
    gen longitude=trim(itrim(ustrregexra(line,`"(.*)longitude="(\d+)"\s+cp(.*)"', "$2"))) if regexm(line, "pdv id=")
    gen cp=trim(itrim(ustrregexra(line,`"(.*)cp="(\d+)"\s+pop(.*)"', "$2"))) if regexm(line, "pdv id=")
    keep if !missing(id)
    keep id latitude longitude cp
    Res.:


    Code:
    . l
    
         +---------------------------------------+
         |      id   latitude   longit~e      cp |
         |---------------------------------------|
      1. | 1000001    4620100     519800   01000 |
      2. | 1000002    4621842     522767   01000 |
         +---------------------------------------+

    Comment

    Working...
    X