Concatenating to previous observation as long as there is no change in varlist

Valentine Laurent

Join Date: Oct 2023
Posts: 7

Concatenating to previous observation as long as there is no change in varlist

03 Jan 2024, 09:32

Dear forum,

I used an R function called -pdftools- allowing me to turn documentation PDF files into datasets, and I am now using Stata to construct a sort of "metadata" base. Here's how my working database looks like (with a bit of cleaning) :

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long(width height x y space) str126 text str27 font_name double font_size long(page obs_number)
89 13  56 572 0 "VARNAME"     "BAAAAA+Arial-BoldMT" 12 73 22435
21 11  56 588 1 "This"        "BAAAAA+Arial-BoldMT" 10 73 22436
17 11  81 588 1 "bit"         "BAAAAA+Arial-BoldMT" 10 73 22437
54 11 101 588 1 "is"          "BAAAAA+Arial-BoldMT" 10 73 22438
 3 11 158 588 1 "a"           "BAAAAA+Arial-BoldMT" 10 73 22439
18 11 164 588 1 "variable"    "BAAAAA+Arial-BoldMT" 10 73 22440
39 11 186 588 0 "description" "BAAAAA+Arial-BoldMT" 10 73 22441
35 11  56 604 1 "The"         "CAAAAA+ArialMT"      10 73 22442
11 11  95 604 1 "input"       "CAAAAA+ArialMT"      10 73 22443
28 11 109 604 1 "variable"    "CAAAAA+ArialMT"      10 73 22444
 5 11 140 604 1 "="           "CAAAAA+ArialMT"      10 73 22445
42 11 148 604 0 "INPUT_VAR"   "CAAAAA+ArialMT"      10 73 22446
23 11  56 620 1 "List"        "BAAAAA+Arial-BoldMT" 10 73 22447
17 11  82 620 1 "of"          "BAAAAA+Arial-BoldMT" 10 73 22448
28 11 102 620 0 "values"      "BAAAAA+Arial-BoldMT" 10 73 22449
22 11 130 633 1 "Missing"     "CAAAAA+ArialMT"      10 73 22450
21 11 155 633 1 "data"        "CAAAAA+ArialMT"      10 73 22451
11 11 180 633 1 "or"          "CAAAAA+ArialMT"      10 73 22452
27 11 194 633 1 "doesn't"     "CAAAAA+ArialMT"      10 73 22453
49 11 224 633 0 "know"        "CAAAAA+ArialMT"      10 73 22454
 5 10  59 649 0 "1"           "CAAAAA+ArialMT"       9 73 22455
15 11 130 648 0 "Yes"         "CAAAAA+ArialMT"      10 73 22456
 5 10  59 664 0 "2"           "CAAAAA+ArialMT"       9 73 22457
18 11 130 664 0 "No"          "CAAAAA+ArialMT"      10 73 22458
end

As you can see, each observation is one word (output of the R function), and I have access to information such as the font characteristics, the coordinates of the word, the page number, etc. What I want to do now is to build back the sentences as they were in the PDF, and in order to do this I need to look at the font size and font name details. I need a code that does this :

"While there is no change in BOTH variables font_name and font_size,
paste the content of text in observation i into the content of text in observation i-1 (with a space) and drop observation i. "

The final dataset should look like this (the rest of the variables should be unchanged)

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str34 text
"VARNAME"                           
"This bit is a variable description"
"The input variable = INPUT_VAR"    
"List of values"                    
"Missing data or doesn't know"      
"1"                                 
"Yes"                               
"2"                                 
"No"                                
end

Please, I'd appreciate any help attempt!
Valentine

Tags: None

George Taylor

Join Date: Dec 2023

Posts: 14
#2

03 Jan 2024, 09:57

The following code will do most of what you are after. I don't know exactly what values you want to keep for width, height, x, y, page, obs_number, but this will keep only the values for the observation where space == 0.

Code:

replace text = text[_n - 1] + " " + text if (font_name == font_name[_n - 1]) & (font_size == font_size[_n - 1]) keep if space == 0
Comment
Valentine Laurent

Join Date: Oct 2023

Posts: 7
#3

03 Jan 2024, 10:07

Thank you for your fast reply George! However, I forgot to mention that space is not a reliable source of information for my purpose, because sentences that take more than one line will have several observation with space == 0. For instance, the last word of the line is considered to have no space, even if it doesn't end a sentence.

I'll find a way to work around this, but in the meantime thank you so much!

Edit :

Code:

keep if font_name != font_name[_n+1] | font_size != font_size[_n+1]

does the job!

Last edited by Valentine Laurent; 03 Jan 2024, 10:51.
Comment

Announcement

Concatenating to previous observation as long as there is no change in varlist

Comment

Comment