Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Concatenating to previous observation as long as there is no change in varlist

    Dear forum,

    I used an R function called -pdftools- allowing me to turn documentation PDF files into datasets, and I am now using Stata to construct a sort of "metadata" base. Here's how my working database looks like (with a bit of cleaning) :

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long(width height x y space) str126 text str27 font_name double font_size long(page obs_number)
    89 13  56 572 0 "VARNAME"     "BAAAAA+Arial-BoldMT" 12 73 22435
    21 11  56 588 1 "This"        "BAAAAA+Arial-BoldMT" 10 73 22436
    17 11  81 588 1 "bit"         "BAAAAA+Arial-BoldMT" 10 73 22437
    54 11 101 588 1 "is"          "BAAAAA+Arial-BoldMT" 10 73 22438
     3 11 158 588 1 "a"           "BAAAAA+Arial-BoldMT" 10 73 22439
    18 11 164 588 1 "variable"    "BAAAAA+Arial-BoldMT" 10 73 22440
    39 11 186 588 0 "description" "BAAAAA+Arial-BoldMT" 10 73 22441
    35 11  56 604 1 "The"         "CAAAAA+ArialMT"      10 73 22442
    11 11  95 604 1 "input"       "CAAAAA+ArialMT"      10 73 22443
    28 11 109 604 1 "variable"    "CAAAAA+ArialMT"      10 73 22444
     5 11 140 604 1 "="           "CAAAAA+ArialMT"      10 73 22445
    42 11 148 604 0 "INPUT_VAR"   "CAAAAA+ArialMT"      10 73 22446
    23 11  56 620 1 "List"        "BAAAAA+Arial-BoldMT" 10 73 22447
    17 11  82 620 1 "of"          "BAAAAA+Arial-BoldMT" 10 73 22448
    28 11 102 620 0 "values"      "BAAAAA+Arial-BoldMT" 10 73 22449
    22 11 130 633 1 "Missing"     "CAAAAA+ArialMT"      10 73 22450
    21 11 155 633 1 "data"        "CAAAAA+ArialMT"      10 73 22451
    11 11 180 633 1 "or"          "CAAAAA+ArialMT"      10 73 22452
    27 11 194 633 1 "doesn't"     "CAAAAA+ArialMT"      10 73 22453
    49 11 224 633 0 "know"        "CAAAAA+ArialMT"      10 73 22454
     5 10  59 649 0 "1"           "CAAAAA+ArialMT"       9 73 22455
    15 11 130 648 0 "Yes"         "CAAAAA+ArialMT"      10 73 22456
     5 10  59 664 0 "2"           "CAAAAA+ArialMT"       9 73 22457
    18 11 130 664 0 "No"          "CAAAAA+ArialMT"      10 73 22458
    end
    As you can see, each observation is one word (output of the R function), and I have access to information such as the font characteristics, the coordinates of the word, the page number, etc. What I want to do now is to build back the sentences as they were in the PDF, and in order to do this I need to look at the font size and font name details. I need a code that does this :

    "While there is no change in BOTH variables font_name and font_size,
    paste the content of text in observation i into the content of text in observation i-1 (with a space) and drop observation i. "

    The final dataset should look like this (the rest of the variables should be unchanged)

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str34 text
    "VARNAME"                           
    "This bit is a variable description"
    "The input variable = INPUT_VAR"    
    "List of values"                    
    "Missing data or doesn't know"      
    "1"                                 
    "Yes"                               
    "2"                                 
    "No"                                
    end
    Please, I'd appreciate any help attempt!
    Valentine

  • #2
    The following code will do most of what you are after. I don't know exactly what values you want to keep for width, height, x, y, page, obs_number, but this will keep only the values for the observation where space == 0.
    Code:
    replace text = text[_n - 1] + " " + text if (font_name == font_name[_n - 1]) & (font_size == font_size[_n - 1])
    keep if space == 0

    Comment


    • #3
      Thank you for your fast reply George! However, I forgot to mention that space is not a reliable source of information for my purpose, because sentences that take more than one line will have several observation with space == 0. For instance, the last word of the line is considered to have no space, even if it doesn't end a sentence.

      I'll find a way to work around this, but in the meantime thank you so much!

      Edit :
      Code:
      keep if font_name != font_name[_n+1] | font_size != font_size[_n+1]
      does the job!
      Last edited by Valentine Laurent; 03 Jan 2024, 10:51.

      Comment

      Working...
      X