Kein Folientitel - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Kein Folientitel

Description:

... hits especially as these names are often mentioned several times in a document. ... 64 times (in 18 patents) without once mentioning the correct name 'Sildenafil' ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 21
Provided by: irfac
Category:

less

Transcript and Presenter's Notes

Title: Kein Folientitel


1
IRF Symposium 2007 8th and 9th November - Vienna
OCR Errors in Patent Full-Text Documents
Perspective of an information professional
2
Searching in full-text patents
Many requests related to pharmaceutical RD
include one or more of the following topics
  • Compounds / Drugs
  • Drug actions
  • Indications
  • Formulations

What kind of errors do we have to deal with when
searching or mining for these aspects?
3
Searching in full-text patents
For all following searches MicroPatent PatSearch
was used. Years 1981 - now
Other full-text patent sources like Espacenet,
STN or Patbase do have the same type of OCR error
problems!
4
Text Mining approach
In a typical workflow of a thesaurus based text
mining approach OCR errors can lead to losses
twice
Generation of synonyms for search
5
Typical OCR Errors
Examples l or i or 1 ? Variations
of alkyl-groups methyi or ethyi or propyi or
butyi 17388 patents ! methy1 or ethy1 or propy1
or buty1 13118 patents !
Variations of emulsion emuision 780
patents emulslon 47 patents emuislon 3 patents
6
Typical OCR Errors
Examples rn or m? l or 1 or i
? Variations of micro rnicro in 5398
patents mlcro in 1004 patents m1cro in 344
patents
2 OCR errors in such a short word are
rare rnlcro in 12 patents rn1cro in 4 patents
7
OCR Errors Formulations
Some variations of microemulsion
8
OCR Errors Compound Names
Searching full-text patents (WO, EP, US, FR, GB,
DE, JP) for the term Simvastatin yields 9030
patents (3666 INPADOC families).
9
OCR Errors Chemical Names
If you think that was bad... look at the IUPAC
names
10
OCR Errors Chemical Names
In 141 patents containing the IUPAC name of
Simvastatin not one (!) contained the correct
name 6(R)-2-8(S)-(2,2-dimethylbutyryloxy)-2(S)
,6(R)-dimethyl-1,2,6,7,8,8a(R)-hexahydronaphthyl-
1(S)ethyl-4(R)-hydroxy-3,4,5,6-tetrahydro-2H-pyra
n-2-one After removing all characters which are
not a letter or number 6R28S22dimethylbutyrylox
y2S6Rdimethyl126788aRhexahydronaphthyl1Sethyl4Rhyd
roxy3456tetrahydro2Hpyran2one 13 out of 141
patents were found...
11
OCR Errors Chemical Names
Searching for (long) IUPAC names in full-text
patents will miss most hits
This is very relevant for all applications which
convert IUPAC names into chemical structures!
Nevertheless, searching for brand names or
generic names will for sure find additional
relevant hits especially as these names are often
mentioned several times in a document.
12
OCR Errors Drug Action
Found variations of Angiotensin II antagonists
Even very short fragments like the roman
numeral II can cause a lot of trouble!
13
Transposed Characters
Some errors cannot originate from an erroneous
OCR process. Accidentally transposed characters
are another source for variations
ehtyl 1565 patents mehtyl 840
patents compuond 231 patents relaese 44
patents formual 1689 patents
14
Wrong Names / Orthography
Many errors are the result of bad spelling or
lack of knowledge of the correct name /
orthography
naphtyl 11206 patents napthyl 11276 patents esther
1387 patents
15
Wrong Names / Orthography
Sepracor INC used the name Sildenophil 64 times
(in 18 patents) without once mentioning the
correct name Sildenafil US6974837 B2
Compositions comprising sibutramine metabolites
in combination with phosphodiesterase
inhibitors SEPRACOR INC ....Particular
phosphodiesterase inhibitors include, but are not
limited to, sildenophil (Viagra),
desmethylsildenophil, vinopocetine, milrinone...
16
Missing Space Characters
Missing space characters can easily cause losses
Example Drug action analyses of pharmaceutical
patents
An extraction based on rules like target1 with
agonist target2 with agonist target3
with agonist etc.... will miss those hits which
have no space character between the target name
and the term agonist PDE 4agonist Adenosin
A2agonist Left truncation is not very
helpful agonist would also yield the
antagonists !
17
Scrambled Tables
Original
18
Error Types
Common error types interfering with searching /
text mining
  • OCR letter misinterpreation I-1-l (methy1) m
    rn (rnicro)
  • Typos mehtyl or relaese or compuond
  • Intentional Errors or lack of knowledge
    Sildenophil
  • Spacing errors ...agonists
  • OCR misinterpretation of text areas
  • inclusion of line numbers into phrases
  • scrambled table structures
  • inclusion of characters from chemical structures
    into phrases

19
Conclusions
What have we learned?
  • All patent full-text databases contain (lots of)
    OCR errors
  • Only some of the errors are so common/systematic
    to be included in searches or text mining
    approaches
  • Numerous errors are so severe and unpredictable
    that they can only be corrected manually
  • Even documents not created via OCR regularly
    contain errors

Quality of future OCR documents will improve but
re-scanning of huge backfile is unrealistic
Smart error correction algorithms and reference
lists can help but good solutions for efficient
manual scanning are very important too!
20
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com