introducing nora: poetry, pattern recognition, and provocation - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

introducing nora: poetry, pattern recognition, and provocation

Description:

We can browse these online archives and collections, and search them in reasonably useful ways. ... Nell Smith, Sara Steger, Kristen Taylor, John Unsworth, and ... – PowerPoint PPT presentation

Number of Views:194
Avg rating:3.0/5.0
Slides: 26
Provided by: matthewkir
Category:

less

Transcript and Presenter's Notes

Title: introducing nora: poetry, pattern recognition, and provocation


1
introducing nora poetry, pattern recognition,
and provocation
  • Matthew Kirschenbaum
  • University of Maryland
  • mgk_at_umd.edu

2
the nora project
  • Funded for two years, 600k, by the Andrew W.
    Mellon foundation
  • Were one year in
  • PI John Unsworth (Dean, GSLIS, Illinois)
  • Participating researchers from Illinois (GSLIS,
    NCSA), Georgia (English), Maryland (MITH, HCIL,
    English), Virginia (IATH, CS, English), Alberta
    (Humanities Computing)
  • Fields of expertise include literary studies,
    library science, computer science
  • www.noraproject.org

3
technologies and heuristics
  • Digital libraries
  • Text mining
  • Machine learning
  • Iteration and play
  • Visualization
  • Provocation and anomaly

4
old school
  • According to Aristotle's commendable formula . .
    . the beautiful is defined as that which the eye
    can easily embrace in its entirety and which can
    be surveyed as a whole. The tragedy of King
    Oedipus may well arouse pity and fear, but
    according to The Poetics it is beautiful because
    it fulfills the temporalized optical requirement
    of having a beginning, a middle, and an end.
    Perception of its form is not resisted by
    boundlessness. Thus, long before Baumgarten's
    modern foundation of the concept and the subject
    matter of Aesthetics, and longer still before the
    term arose which will have guided my commentary,
    Aesthetics begins as "pattern recognition.
  • --Friedrich Kittler, "The World of the
    Symbolic--A World of the Machine"

5
digital libraries
  • A lot of us have spent the last 10 to 20 years
    digitizing large piles of texts and other media
  • We can browse these online archives and
    collections, and search them in reasonably useful
    ways. But what else can we do?
  • nora starts with 5 GB of 18th and 19th century
    British and American literature contributed from
    about a dozen different repositories and
    collections

6
text mining
  • The semi-automated discovery of trends and
    patterns across very large datasets (Hearst 3).
  • Don Swansons association of magnesium deficiency
    with migraine headaches by mining bio-medical
    literature in the 1980s
  • A new, potentially plausible medical hypothesis
    was derived from a combination of text fragments
    and the explorers medical expertise (Hearst 6).
  • Everyday Uses of This
  • Classification
  • Matching
  • Clustering
  • Prediction

7
classification (sorting)
8
matching (More like this . . .)
9
clustering
10
prediction (machine learning)
11
i. emily dickinson, hot or not?
  • Prediction can the machine be taught to
    identify patterns of erotic language in the
    corpus of correspondencesome 200
    lettersexchanged between Emily Dickinson and
    Susan Huntington Dickinson (Emilys
    sister-in-law)?
  • We remind her we love her - Unimportant fact,
    though Dante did'nt think so, nor Swift, nor
    Mirabeau.

12
what we did
  • Two members of the project team with expertise in
    Dickinsons writings, Martha Nell Smith and Tanya
    Clement, sat down with the corpus and labeled
    each text hot or not. Yes, this was to an extent
    arbitrary.
  • These evaluations were passed to a data mining
    expert (Bei Yu) who subjected the corpus to a
    kind of predictive analysis known as Naïve
    Bayesian classification.

13
the black box
  • Bayes is Thomas Bayes (1702-1761), the British
    mathematician and minister
  • Bayesian probability is the domain of probability
    that deals with non-quantifiable events not
    whether a coin will land heads or tails for
    instance, but rather the percentage of people who
    believe the coin might land on its side also
    known as subjective probability
  • Our Bayesian classification is naïve because it
    deliberately does not consider relationships and
    dependencies between words we might instinctively
    think go togetherkiss and lips, for example.
    The algorithm merely establishes the presence or
    absence of one or more words, and takes their
    presence or absence into account when assigning a
    probability value to the overall text. This is
    the kind of thing computers are very good at, and
    naïve Bayes has been proven surprisingly reliable
    in a number of different text classification
    domains.
  • The math for this is
  • Which is the probability that a given document D
    belongs to a given class C.

14
raw output
  • Some 4000 featuresdiscrete word typesextracted
    and ranked
  • CLASS_PROB_Hot CLASS_PROB_Not RATIO FEATURE
    double double double String 7.140671223094971E-4
    7.871536523929471E-5 2.2051385922805053 mine
    7.140671223094971E-4 7.871536523929471E-5
    2.2051385922805053 must 6.120575334081404E-4
    7.871536523929471E-5 2.0509879124532464 Bud
    6.120575334081404E-4 7.871536523929471E-5
    2.0509879124532464 Woman 6.120575334081404E-4
    7.871536523929471E-5 2.0509879124532464 Vinnie
    6.120575334081404E-4 7.871536523929471E-5
    2.0509879124532464 joy 6.120575334081404E-4
    7.871536523929471E-5 2.0509879124532464 Thee
    5.100479445067837E-4 7.871536523929471E-5
    1.8686663556592924 write 5.100479445067837E-4
    7.871536523929471E-5 1.8686663556592924 Eden
  • 5.100479445067837E-4 7.871536523929471E-5
    1.8686663556592924 luxury 5.100479445067837E-4
    7.871536523929471E-5 1.8686663556592924 Alps
    5.100479445067837E-4 7.871536523929471E-5
    1.8686663556592924 weary 5.100479445067837E-4
    7.871536523929471E-5 1.8686663556592924 sick
    5.100479445067837E-4 7.871536523929471E-5
    1.8686663556592924 Hour 5.100479445067837E-4
    7.871536523929471E-5 1.8686663556592924 Dream
    0.009384882178924818 0.0014955919395465995
    1.8365780411077903 you 0.001836172600224421
    3.1486146095717883E-4 1.7633058400014656 Susie
    4.080383556054269E-4 7.871536523929471E-5
    1.6455228043450818 remember 4.080383556054269E-4
    7.871536523929471E-5 1.6455228043450818 garden
    4.080383556054269E-4 7.871536523929471E-5
    1.6455228043450818 dream 4.080383556054269E-4
    7.871536523929471E-5 1.6455228043450818 cant
    4.080383556054269E-4 7.871536523929471E-5
    1.6455228043450818 always 4.080383556054269E-4
    7.871536523929471E-5 1.6455228043450818 loved
    4.080383556054269E-4 7.871536523929471E-5
    1.6455228043450818 lives

15
some results
Slide credit Bei Yu
16
ii. deduction
  • Can we decide what humanists might want to do
    with text mining technology by using deductive
    methods?
  • Compare verbs from scholarly prose in ALH and ELH
    to usage of these same verbs in the American
    National Corpus (drawn from the NY Times and
    other newspapers)

17
holding up the mirror
  • totalizing
  • reprinting
  • mediating
  • slaveholding
  • privileging
  • absenting
  • misreading
  • desiring
  • seafaring
  • fetishizing
  • What would it mean to teach a machine to
    recognize the way in which a text privileges,
    fetishizes, or totalizes?
  • Wiser heads than ours are working on it. Dominic
    Widdows, Geometry and Meaning (Stanford CSLI
    Publications, 2004).

18
iii. it always ends in tears reading sentiment
  • Can we teach a machine to recognize and read
    sentimental literature?
  • Stage 1. Evaluate the use of text-mining on a
    small set of core" sentimental novels. We will
    label a subset of the chapters (the training set)
    with a a score indicating a level of
    sentimentalism, and then see how text mining
    classifies the remaining chapters from those
    novels. (Texts Charlotte, Uncle Toms Cabin,
    Incidents in the Life of a Slave Girl.)
  • Stage 2. Two more novels will be added to the set
    to evaluate how well the processes work when more
    texts by the same authors are added to the set of
    works studied. (Texts added Charlottes
    Daughter, The Ministers Wooing.)
  • Stage 3. The texts added will be those that
    scholars recognize as exhibiting sentimentalism,
    though some may not be as consistently
    sentimental chapter-by-chapter as the "core" set
    used earlier. In this experiment there will be
    more focus will be on gaining insights on
    sentimentalism and these novels than in previous
    experiments. (Some likely texts Clotelle, The
    Lamplighter, The Coquette, Hobomok. )
  • Stage 4. Using a text-mining model that was
    developed to identify chapters with strong
    sentimentalism, use text-mining on a set of works
    that are considered by scholars not to be wholly
    sentimental or just not sentimental at all. This
    may identify parts of texts that contain aspects
    of sentimentalism, or common word-use that is
    sentimental in one novel but not another. In this
    experiment there will be a strong focus will be
    on gaining insights on sentimentalism and these
    novels. (Texts Moby-Dick, The Scarlet Letter,
    The Blithedale Romance, Irvings Sketchbook,
    Narrative of the Life of Frederick Douglass.)

19
a general use scenario
  • User begins with a set of documents, obtained
    from one or more federated digital library
    collections.
  • The computer puts up a random list of ten items
    in your subset, and you rank each one, either on
    a yes/no scale, or on a 1-5 scale.  We warn you
    that if yes/no, less training will be required
    if 1-5, more training will be required.  I mark
    the ones that are like what I'm looking for, and
    the ones that are not.  Each subsequent retrieval
    of 10 items builds on the last, inasmuch it takes
    your last ranking into account as it evaluates
    semantic distance from your best examples---or
    possibly employs some other similarity measure.
     This positive/negative ranking is the basis for
    the next round....
  • Having been told that a particular set is rich in
    the feature ofinterest, and another set is weak
    in that, our software brings up iterative guesses
    about what's of interest, and invites the user to
    rank known items.
  • Arriving at what seems like the optimal set of
    "things like what I'm interested in," the user
    pushes a button to ask what things seem to
    predict this set--what are the characteristics
    that are highly correlated with this feature?

20
the architecture
21
visualization and graphesis
22
provocation and anomaly
  • My machine doesn't just analyze and spit out
    results. All it says is, Here is strangeness.
    Im the one who gets to look.
  • --Randall McLeod, working at his collator

23
www.noraproject.org
24
acknowledgements
  • Members of the nora project team whose work
    contributed most directly to the results
    presented here include Loretta Auvil, Tanya
    Clement, Tom Horton, Greg Lord, Catherine
    Plaisant, Steve Ramsay, James Rose, Martha Nell
    Smith, Sara Steger, Kristen Taylor, John
    Unsworth, and Bei Yu

25
the introduction
  • This is Nora, Stella says, stepping softly
    past Cayce to lay her hands on the shawl-draped
    shoulders of the figure in the chair before the
    screen. Noras right hand pauses. Still resting
    on the mouse, though Cayce senses this has
    nothing to do with her sisters touch, or the
    presence of a stranger.
  • --William Gibson, Pattern Recognition
Write a Comment
User Comments (0)
About PowerShow.com