Patrick Juola - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Patrick Juola

Description:

Authorship Attribution: What Mixture-of-Experts Says We Don't Yet Know. Inference of traits of author (identity, gender, socioeconomics, age, &c) from writings ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 18
Provided by: corpu
Category:
Tags: don | juola | patrick

less

Transcript and Presenter's Notes

Title: Patrick Juola


1
Authorship Attribution What Mixture-of-Experts
Says We Dont Yet Know
  • Patrick Juola
  • Duquesne University
  • Juola_at_mathcs.duq.edu

2
Authorship Attribution
  • Inference of traits of author (identity, gender,
    socioeconomics, age, c) from writings
  • Wide application
  • Linguistics
  • History
  • Literature
  • Forensics
  • c.

3
Authorship Attribution
  • Traditionally done by close reading
  • Current research spurt in corpus-based and
    statistical methods. (nontraditional)
  • E.g. Author 2 uses more prepositions than
    author 6
  • Effective, but theoretically ill-founded

4
A realists view
  • Improved corpora, improved computers, active
    community yield practical results
  • Need
  • Standardization of methods
  • Standardization of problems
  • Standardization of testbeds
  • History of successful applications

5
A realists view (cont.)
  • Many methods work
  • Which ones work the best?
  • Many different applications
  • Which applications will get impact?
  • Need to tune tests, applications
  • Need for explanatory capacity.
  • A detective novel is unsatisfying if you dont
    understand the sleuth

6
Stylome van Halteren
  • Handwriting analysis -- assumes people have
    persistent, uncontrollable habits of penmanship
  • Authorship attribution -- assumes people have
    persistent, uncontrollable habits of thought
    and/or phrasing
  • This assumption needs testing, specification,
    and validation

7
AAAC
  • Ad-hoc Authorship Attribution Contest (Juola,
    ACH/ALLC 2004)
  • June, 2004, Gothenberg, Sweden
  • Compare professional ad-hocracy of methods
    using controlled corpus
  • Establish set of effective methods
  • 13 problems in various lg. genres (see Juola,
    2006 Juola, 2008 for details)

8
Results (1)
  • Problems admittedly difficult
  • Problem F (Paston letters) -- Middle English
    letters, lt1000 words
  • Problem A,M (essays) -- fixed topics, homogenous
    writers, lt1500 words
  • Results nevertheless very good
  • (Schler) 71 (Keselj) 69 (Juola) 65
  • Simple lexical statistics not very good

9
Results (2)
  • Language independent
  • English/non-English correlation 0.5938 (p lt
    0.05)
  • May be size-independent
  • Large/small problem correlation 0.3141
  • Good algorithms seem to trump
  • Lesson if you cant get 90 on the Paston
    letters, your algorithm isnt good enough

10
Results reanalysis
  • Top 5 finishers (out of 1300)
  • Koppel/Schler 918
  • Keselj/Cercone 897
  • van Halteren 861
  • Juola 851
  • Coburn 804
  • 3 MoE 914
  • 5 MoE 924

11
Methods
  • Top 5 finishers (out of 1300)
  • K/S unstable words/SVM
  • K/C byte n-grams/k-NN
  • vH word tokens/profiling
  • Juola characters/cross-entropy
  • Coburn word n-grams/graph cuts
  • 3 MoE ???
  • 5 MoE ?????

12
What are we measuring?
  • Different event sets
  • Different statistics
  • Adding more makes more accurate
  • but were all measuring similar things
  • Whats the real thing we should be measuring?
    Wheres our theory?
  • Can we mix-and-match to find best approach?

13
JGAAP
  • Java Graphical Authorship Attribution Processor
  • Java-based software incorporating three-phase
    framework, with selectable phase processors
  • Extensible and distributable (demo available
    afterwards)
  • How to make more accessible?

14
An idealists view
  • Tie authorship attribution features to features
    of author or authorship process
  • Be able to speak not just of authorship, but
    about author.
  • Can we justify linguistic aspects in terms of
    bio/cognitive features?

15
Linguistic diagnostics
  • Schizophrenic language is atypical.
  • E.g. altered idea density.
  • Early-onset Alzheimers, pugilistic aphasia
  • Altered sentence complexity
  • Medical use as diagnostic criterion
  • Hemingways authorial traits?

16
Lessons
  • Authorship attribution is possible, but
    theoretically ill-grounded
  • Lots of ways to measure almost the same thing
    too much repetition
  • Too many different ways to process
  • Not clear what the important underlying factors
    are
  • Need theoretical basis

17
Some advertisements
  • Hey, if Harald can do it, so can I!
  • JGAAP www.jgaap.com
  • JGAAP Wiki same
  • Juola, Patrick. (2008). Authorship Attribution.
    Forthcoming from NOW Publishing as part of
    Foundations and Trends in Information Retrieval.
Write a Comment
User Comments (0)
About PowerShow.com