Artificial Intelligence and the Internet - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Artificial Intelligence and the Internet

Description:

Intelligent agents, Expert systems, Multi-agent models ... If reality is socially constructed, and 'beauty is in the eye of the beholder' ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 33

Provided by: edward89

Category:

more less

Transcript and Presenter's Notes

Title: Artificial Intelligence and the Internet

1
Artificial Intelligence and the Internet

Edward Brent
University of Missouri Columbia and Idea Works,
Inc.
Theodore Carnahan
Idea Works, Inc.

2
Overview

Objective Consider how AI can be (and in many
cases is being) used to enhance and transform
social research on the Internet
Framework intersection of AI and research
issues
View Internet as a source of data whose size and
rate of growth make it important to automate much
of the analysis of data

3
Overview (continued)

We discuss a leading AI-based approach, the
semantic web, and an alternative paradigmatic
approach, and the strengths and weaknesses of
each
We explore how other AI strategies can be used
including intelligent agents, multi-agent
systems, expert systems, semantic networks,
natural language understanding, genetic
algorithms, neural networks, machine learning,
and data mining
We conclude by considering implications for
future research

4
Key Features of the Internet

Decentralized
Few or no standards for much of the substantive
content
Incredibly diverse information
Massive and growing rapidly
Unstructured data

5
The Good News About the Internet

A massive flow of data
Digitized
A researchers dream

6
The Bad News

A massive flow of data
Digitized
A researchers nightmare

7
Data Flows

The Internet provides many examples of data
flows.
A data flow is an ongoing flux of new
information, often from multiple sources, and
typically large in volume.
Data flows are the result of ongoing social
processes in which information is gathered and/or
disseminated by humans for the assessment or
consumption by others.
Not all data flows are digital, but all flows on
the Internet are.
Data flows are increasingly available over the
internet.
Examples of data flows include
News articles Published research articles
eMail Medical records
Personnel records Articles submitted for
publication
Research proposals Arrest records
Birth and death records

8
Data Flows vs Data Sets

Data flows are fundamentally different from the
data sets with which most social scientists have
traditionally worked.

A data set is a collection of data, often collected for a specific purpose and over a specific period of time, then frozen in place. A data flow is an ongoing flux of new information, with no clear end in sight.
Data sets typically must be created in research projects funded for that purpose in which relevant data are collected, formatted, cleaned, stored, and analyzed. Data flows are the result of ongoing social processes in which information is gathered and/or disseminated by humans for the assessment or consumption by others.
Data sets are sometimes analyzed only once in the context of the initial study, but are often made available in data archives to other researchers for further analysis. Data flows often merit continuing analysis, not only of delimited data sets from specific time periods, but as part of ongoing monitoring and control efforts.
9
The Need for Automating Analysis

Together, the tremendous volume and rate of
growth of the Internet, and the prevalence of
ongoing data flows make automating analysis both
more important and more cost-effective.
Greater cost savings result from automated
analysis with very large data sets
Ongoing data flows require continuing analysis
and that also makes automation cost-effective

10
AI and Automating Research

Artificial Intelligence strategies offer a number
of ways to automate research on the Internet.
We

11
Contemporary Social Research on the Web

Formulate the research problem
Search for and sample web sites containing
relevant data
Process, format, store data for analysis
Develop a coding scheme
Code web pages for analysis
Conduct analyses

12
Strengths and Weaknesses of Contemporary Approach

May use qualitative or quantitative programs to
assist with the coding and analysis
Advantages
Versatile
Gives researcher much control
Disadvantages
Coding schemes often not shared, requiring more
effort, making research less cumulative and less
objective
Expensive and time-consuming
Unlikely to keep up with rapidly changing data in
data flows
Not cost-effective for ongoing analysis and
monitoring

13
The Semantic Web

The semantic web is an effort to build into the
World Wide Web tags or markers for data along
with representations of the semantic meaning of
those tags (Berners-Lee and Lassila, 2001
Shadbolt, Hall and Berners-Lee, 2006).
The semantic web will make it possible for
computer programs to recognize information of a
specific type in any of many different locations
on the web and to understand the semantic
meaning of that information well enough to reason
about it.
This will produce interoperability the ability
of different applications and databases to
exchange information and to be able to use that
information effectively across applications.
Such a web can provide an infrastructure to
facilitate and enhance many things including
social science research.

14
Implementing the Semantic Web
Contemporary Research Possible Implementation of the Semantic Web
Coding scheme XML Schema a standardized set of XML tags used to markup web pages. For example, research proposals might include tags such as ltdesigngt ltsampling plangt lthypothesisgt ltfindingsgt
Coded data Web pages marked up with XML (extensible markup language) a general-purpose markup language designed to be readable by humans while at the same time providing metadata tags for various kinds of substantive content that can be easily recognized by computers
Knowledge representation Resource Description Framework a general model for expressing knowledge as subject-predicate-object statements about resources A sample plan in a research proposal might include these statements Systematic sampling - is a - sampling procedure Sampling procedure - is part of - a sampling plan
Theory Ontology a knowledgebase of objects, classes of objects, attributes describing those objects, and relationships among objects An ontology is essentially a formal representation of a theory
Analysis Intelligent agents software programs capable of navigating to relevant web pages and using information accessible through the semantic web to perform useful functions
15
The Semantic Web What Can It Do?

Illustrate briefly

16
AI Strategies and the Semantic Web

Several components of the semantic web make use
of artificial intelligence (AI) strategies

Semantic Web Component Artificial intelligence and related computational strategies
Knowledge representation Object-Attribute-Value (O-A-V) triplets commonly used in semantic networks
Theory Semantic network
Analysis Intelligent agents, Expert systems, Multi-agent models Distributed computing, parallel processing, grid
17
Strengths of the Semantic Web

Fast and efficient to develop
Most coding done by web developers one time and
used by everyone
Fast and efficient to use
Intelligent agents can do most of the work with
little human intervention
Structure provided makes it easier for computers
to process
Can take advantage of distributed processing and
grid computing
Interoperability
Many different applications can access and use
information from throughout the web

18
Weaknesses of the Semantic Web (Pragmatic
Concerns)

Seeks to impose standardization on a highly
decentralized process of web development
Requires cooperation of many if not all
developers
Imposes the double burden of expressing knowledge
for humans and for computers
How will tens of millions of legacy web sites be
retrofitted?
What alternative procedures will be needed for
noncompliant web sites?
Major forms of data on the web are provided by
untrained users unlikely to be able to markup for
the semantic web
E.g., blogs, input to online surveys, emails,

19
Weaknesses of the Semantic Web (Fundamental
Concerns)

Assumes there is a single ontology that can be
used for all web pages and all users (at least in
some domain).
For example, a standard way to markup products
and prices in commercial web sites could make it
possible for intelligent agents to search the
Internet for the best price for a particular make
and model of car.
This assumption may be inherently flawed for
social research for two reasons.
1) Multiple paradigms - What ontology could code
web pages from multiple competing paradigms or
world views (Kuhn, 1969).
If reality is socially constructed, and beauty
is in the eye of the beholder how can a single
ontology represent such diverse views?
2) Competing interests What if developers of
web pages have political or economic interests at
odds with some of the viewers of those web pages?

20
Multiple Perspectives

Chomskys deep structure vs subtexts

21
Contested terms
22
Paradigmatic Approach

We describe an alternative approach to the
semantic web, one that we believe may be more
suitable for many social science research
applications.
Recognizes there may be multiple incompatible
views of data
Data structure must be imposed on data
dynamically by the researcher as part of the
research process
(in contrast to the semantic web which seeks to
build an infrastructure of web pages with data
structure pre-coded by web developers)

23
Paradigmatic Approach (continued)

Relies heavily on natural language processing
(NLP) strategies to code data.
NLP capabilities are not already developed for
many of these research areas and must be
developed.
Those NLP procedures are often developed and
refined using machine learning strategies.
We will compare the paradigmatic approach to
traditional research strategies and the Semantic
Web for important research tasks.

24
Example Areas Illustrating the Paradigmatic
Approach

Event analysis in international relations
Essay grading
Tracking news reports on social issues or for
clients
E.g., Campaigns, Corporations, Press agents
Each of these areas illustrate significant data
flows.
These areas and programs within them illustrate
elements of the paradigmatic approach.
Most do not yet employ all the strategies.

25
Essay Grading

These are programs that allow students to submit
essays using the computer then a computer program
examines the essays and computes a score for the
student.
Some of the programs also provide feedback to the
student to help them improve.
These programs are becoming more common for
standardized assessment tests and classroom
applications.
Examples of programs
SAGrader
E-rater
C-rater
Intelligent Essay Assessor
Criterion
These programs illustrate large ongoing data
flows and generally reflect the paradigmatic
approach.

26
Digitizing Data
Task Traditional Research Semantic Web Paradigmatic Approach
Digitizing Data from Internet digitized by web page developers. Other data must be digitized by researcher or analyzed manually. This can be a huge hurdle. Data digitized by web page developers Data digitized by web page developers

The first step in any computer analysis must be
converting relevant data to digital form where it
is expressed as a stream of digits that can be
transmitted and manipulated by computers
These two approaches both rely on web page
developers to digitize information. This gives
them a distinct advantage over traditional
research where digitizing data can be a major
hurdle.

27
Essay Grading Digitizing Data

Digitizing
Papers replaced with digital submissions
SAGrader, for example, has students submit their
papers over the Internet using standard web
browsers.
Digitizing often still a major hurdle limiting
use
Access issues
Security concerns

28
Data Conversions
Task Traditional Research Semantic Web Paradigmatic Approach
Converted Data Digitized data suitable for web delivery for human interpretation Digitized data suitable for web delivery for human interpretation Digitized data suitable for web delivery and machine interpretation
Converting No further data conversions required once digitized by web page author No further data conversions required once digitized by web page author Further conversion sometimes required by researcher (e.g., OCR, speech recognition, handwriting recognition)
29
Essay Grading Converting Data

Data conversion
Where essays are submitted on paper, optical
character recognition (OCR) or handwriting
recognition programs must be used to convert to
digitized text.
Standardized testing programs often face this
issue

30
Encoding Data
Task Traditional Research Semantic Web Paradigmatic Approach
Encoding Data Encoding done by researcher (often with use of qualitative or quantitative programs) Each web page developer must encode small or moderate amount of data Researchers must encode massive amounts of data Encoding automated using NLP strategies (including statistical, linguistic, rule-based expert systems, and combined strategies) machine learning (unsupervised learning, supervised learning, neural networks, genetic algorithms, data mining)
Coded Data Coded data based on coding rubric XML markup based on standard ontology An XML schema indicates the basic structure expected for a web page XML markup based on ontology for that paradigm An XML schema indicates the basic structure expected for a web page
31
Essay Grading Coding

Essay grading programs employ a wide array of
strategies for recognizing important features in
essays.
Intelligent Essay Assessor (IEA) employs a purely
statistical approach, latent semantic analysis
(LSA).
This approach treats essays like a bag of words
using a matrix of word frequencies by essays and
factor analysis to find an underlying semantic
space. It then locates each essay in that space
and assesses how closely it matches essays with
known scores.
E-rater uses a combination of statistical and
linguistic approaches.
It uses syntactic, discourse structure, and
content features to predict scores for essays
after the program has been trained to match human
coders.
SAGrader uses a strategy that blends linguistic,
statistical, and AI approaches.
It uses fuzzy logic to detect key concepts in
student papers and a semantic network to
represent the semantic information that should be
present in good essays.
All of these programs require learning before
they can be used to grade essays in a specific
domain.

32
Knowledge
Task Traditional Research Semantic Web Paradigmatic Approach
Knowledge Theory A single shared world-view or objective reality Multiple paradigms
Knowledge Coding scheme implemented with a Codebook (often imperfect) Ontology (knowledgebase developed by web page developers and shared as standard) (implemented with RDF and ontological languages) Multiple ontologies, one for each paradigm (developed by researchers and shared within paradigm) (implemented with RDF and ontological languages)
33
Essay Grading Knowledge

Most essay grading programs have very little in
the way of a representation of theory or
knowledge.
This is probably because they are often designed
specifically for grading essays and are not meant
to be used for other purposes requiring theory,
such as social science research.
For example, C-rater, a program that emphasizes
semantic content in essays, yet has no
representation of semantic content other than as
desirable features for the essay.
The exception is SAGrader.
SAGrader employs technologies developed in a
qualitative analysis program, Qualrus. Hence,
SAGrader uses a semantic network to explicitly
represent and reason about the knowledge or
theory.

34
Analysis
Task Traditional Research Semantic Web Paradigmatic Approach
Analysis Analysis (by hand, perhaps with help of qualitative or quantitative programs) Intelligent Agents Intelligent agents
The semantic web and paradigmatic approaches can
take similar approaches to analysis.
35
Essay Grading Analysis

All programs produce scores, though the precision
and complexity of the scores varies.
Some produce explanations
Most of these essay grading programs simply
perform a one-time analysis (grading) of papers.
However some of them, such as SAGrader, provide
for ongoing monitoring of student performance as
students revise and resubmit their papers.
Since essays presented to the programs are
already converted into standard formats and are
submitted to a central site for processing, there
is no need for the search and retrieval
capabilities of intelligent agents

36
Advantages of Paradigmatic Approach

Suitable for multiple-paradigm fields
Suitable for contested issues
Does not require as much infrastructure
development on the web
Can be used for new views requiring different
codes with little lag time

37
Disadvantages of Paradigmatic Approach

Relies heavily on NLP technologies that are still
evolving
May not be feasible in some or all circumstances
Requires extensive machine learning
Often requires additional data conversion for
automated analysis
Requires individual web pages to be coded once
for each paradigm rather than a single time,
hence increasing costs. (However, by automating
this, costs are made manageable)
Current NLP capabilities are limited to problems
of restricted scope. Instead of general-purpose
NLP programs, they are better characterized as
special-purpose NLP programs.

38
Structured Data

Structured data data stored in a computer in a
manner that makes it efficient to examine
A good data structure does much of the work,
making the algorithms required for some kinds of
reasoning straightforward, even trivial.
Examples of structured data include data stored
in spreadsheets, statistical programs, and data
bases.
Unstructured data data stored in a manner that
does not make it efficient to examine
Examples of unstructured data include newspaper
articles, blogs, interview transcripts, and
graphics files.
A structured unstructured dichotomy is an
oversimplification
Data well-structured for some purposes may not be
well-structured for other purposes.
For viewing by humans
E.g., photographs, protected pdf files
For processing by programs
E.g., text, doc, html
Marked for analysis (semantic web)

39
Event Analysis

How is this data flow?

40
Event Analysis

Schrodts discusison of various coding schemes

41
Discussion and Conclusions

Both semantic web and paradigmatic approaches
have advantages and disadvantages
Codes on semantic web could facilitate coding by
paradigmatic-approach programs
Where there is much consensus the single coding
for the semantic web could be sufficient
While the infrastructure for the semantic web is
still in development the paradigmatic approach
could facilitate analysis of legacy data
The paradigmatic approach could be used to build
out the infrastructure for the semantic web