The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpus-assisted discourse Studies. However, OCR software is not totally accurate, and the resulting error rate may compromise the qualitative analysis of the studies. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. We present a case study on newspapers of the beginning of the 20th century for the linguistic analysis of the metaphors representing immigrants.
OCR Correction for Corpus-assisted Discourse Studies: a Case Study of Old Newspapers
Dario Del Fante
Primo
;
2021
Abstract
The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpus-assisted discourse Studies. However, OCR software is not totally accurate, and the resulting error rate may compromise the qualitative analysis of the studies. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. We present a case study on newspapers of the beginning of the 20th century for the linguistic analysis of the metaphors representing immigrants.File in questo prodotto:
File | Dimensione | Formato | |
---|---|---|---|
AIUCD 2021 - Del Fante.pdf
accesso aperto
Descrizione: versione editoriale
Tipologia:
Full text (versione editoriale)
Licenza:
Creative commons
Dimensione
660.13 kB
Formato
Adobe PDF
|
660.13 kB | Adobe PDF | Visualizza/Apri |
I documenti in SFERA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.