Title: Genetic Algorithms for Information Retrieval
1Genetic Algorithms for Information Retrieval
- Presented to
- Dr. Eyyas Qawasmeh.
- Prepared by
- Duaa Sawalha
- 20063173038
2Outline
- Introduction.
- Genetic Algorithms in information retrieval.
- 2.1. Chromosome representation
- 2.2. Fitness evaluation
- 2.3. Selection
- 2.4. Crossover
- 2.5. Mutation
- Suggested steps of GA in IR
- A GA Example
- Conclusion
- References
31. Introduction
Docs
Index Terms
doc
match
Information Need
Ranking
query
42. Genetic Algorithms in Information Retrieval
- The GA starts with a limited number of
individuals from P (initial population). - The iterative search process is based
on the competition of these individuals and
their descendants during a number of
generations. - The individuals are coded according to the
chromosome model as a string of length l. - The simplest GA constructs a new generation from
an old one following three steps reproduction,
crossover, and mutation.
52.1 Chromosome Representation
- A document vector (Doc) with n keywords and a
query vector with m query terms can be
represented as - Doc (term1, term2, term3 ,..termn )
- Query (qterm1, qterm2, qterm3,..qtermm)
- By using binary term vector, each termi (or
qtermj) is either 0 or 1. Termi is set to zero
when termi is not presented in document and set
to one when termi is presented in document.
62.1 Chromosome Representation (cont.)
- For example, user enters a query into our system
that could retrieve 5 documents. These documents
are - Doc1 Relational Databases, Query, Data
Retrieval, Computer Networks, DBMS - Doc2 Artificial Intelligence, Internet,
Indexing, Natural Language Processing - Doc3 Databases, Expert System, Information
Retrieval System, Multimedia - Doc4 Fuzzy Logic, Neural Network, Computer
Networks - Doc5 Object-Oriented, DBMS, Query, Indexing
72.1 Chromosome Representation (cont.)
- All keywords of these documents can be arranged
in the ascending order as - Artificial Intelligence, Computer Networks, Data
Retrieval, Databases, DBMS, Expert System, Fuzzy
Logic, Indexing, Information Retrieval System,
Internet, Multimedia, Natural Language
Processing, Neural Network, Object-Oriented,
Query, Relational Databases. - Encode in the chromosome representation as
- Doc1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1
- Doc2 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0
- Doc3 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0
- Doc4 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0
- Doc5 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0
82.2 Fitness Evaluation
- The meaning of fitness in GA perspective is Those
genes that are most fit are most likely to
survive, with less fit genes dying off, being
replaced by the fitter genes. - Several possible functions may be used in
determining the fitness and efficacy of a
grammar, such as - Average Search Length (ASL).
- average maximum parse length (AMPL).
- Dice coefficient measure, Cosine coefficient, and
Jaccard coefficient measure if the vector space
model is used.
92.3 Selection
- The selection process in the genetic inheritance
is the best chromosome gets more copies, the
average stay even, and the worst die off. - In Genetic algorithms the selection of a new
population is with respect to the probability
distribution based on the fitness values. - In Information retrieval many researchers used
the roulette wheel reproduction process.
102.4 Crossover
- The intuition of crossover in the Genetic
Algorithm is to produce new solutions from the
existing one. There is maybe one point crossover
or multiple points' crossover. - The suggested crossover for IR is multiple point
crossovers. High fitness chromosomes are more
likely to be chosen in the crossover process. - For example, two chromosomes are crossover
between position 5 and 11. - 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1
- 1 0 0 1 1 0 0 1 1 1 1 0 0 0 0
- The resulting crossover yields two new
chromosomes. - 1 0 1 1 1 0 0 1 1 1 1 1 1 0 1
- 1 0 0 1 1 1 1 1 0 0 1 0 0 0 0
112.5 Mutation
- It can help the search find solutions that
crossover alone might not encounter. - Chromosomes may be better or poorer than old
chromosomes. If they are poorer than old
chromosomes, they are eliminated in selection
step. - The objective of mutation is restoring lost and
exploring variety of data. - Example
- 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1
- Result 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1
123. Suggested steps of GA in IR system
- User enters query into IR system.
- Match keywords from user query with list of
keywords. - Encode documents retrieved by user query to
chromosomes (initial population). - Population feed into genetic operator process
such as selection, crossover, and mutation. - Do step 4 until max generation is reached. An
optimize query chromosome for document retrieval
will be achieved. - Decode optimize query chromosome to query and
retrieve document from database.
13Enter user Query
Query keywords Text keywords
Yes
Encode retrieved doc to chromosomes (Generate
initial population )
Feed population into GA process (selection,
crossover and mutation )
IS max generation reached
No
Yes
Decode optimize query chromosome
Retrieve document from database
144. A GA Example
- . Input Documents and Keywords
-
- DOC0 DATA RETRIEVAL, DATABASE, COMPUTER
NETWORKS,IMPROVEMENTS, INFORMATION RETRIEVAL,
METHOD, NETWORK,MULTIPLE, QUERY, RELATION,
RELATIONAL, RETRIEVAL, QUERIES, RELATIONAL
DATABASES, RELATIONAL DATABASE, US, CARAT.DAT,
GQP.DAT,ORUS.DAT, QUERY.OPT - DOC1 INFORMATION, INFORMATION RETRIEVAL,
INFORMATION STORAGE,INDEXING, RETRIEVAL,
STORAGE, US, KEVIN.HOT - DOC2 ARTIFICIAL INTELLIGENCE, INFORMATION
RETRIEVAL SYSTEMS,INFORMATION RETRIEVAL,
INDEXING, NATURAL LANGUAGE PROCESSING, US,
DBMS.AI, GQP.DAT - DOC3 FUZZY SET THEORY, INFORMATION RETRIEVAL
SYSTEMS, INDEXING, PERFORMANCE, RETRIEVAL
SYSTEMS, RETRIEVAL, QUERIES, US, KEVIN.HOT - DOC4 INFORMATION RETRIEVAL SYSTEMS, INDEXING,
RETRIEVAL, STAIRS, US, KEVIN.HOT
154. A GA Example (cont.)
- . Total Set of Concepts
-
- DATA RETRIEVAL, DATABASE, COMPUTER NETWORKS,
IMPROVEMENTS, INFORMATION RETRIEVAL, METHOD,
NETWORK, MULTIPLE, QUERY, RELATION, RELATIONAL,
RETRIEVAL, QUERIES, RELATIONAL DATABASES,
RELATIONAL DATABASE, US, CARAT.DAT, GQP.DAT,
ORUS.DAT, QUERY.OPT, INFORMATION, INFORMATION
STORAGE, INDEXING, STORAGE, KEVIN.HOT, ARTIFICIAL
INTELLIGENCE, INFORMATION RETRIEVAL SYSTEMS,
NATURAL LANGUAGE PROCESSING, DBMS.AI, FUZZY SET
THEORY, PERFORMANCE, RETRIEVAL SYSTEMS, STAIRS,
164. A GA Example (cont.)
- . Initial Genetic Pattern of Chromosome in
Population - chromosome fitness1111111111111111111100000000000
00 0.287744000010000001000100001111100000000
0.411692000010000000000101000010011110000
0.367556000000000001100100000010101001110
0.427473000000000001000100000010101000001
0.451212 - Average Fitness 0.3891
174. A GA Example (cont.)
- A document which included more concepts shared by
other documents had a higher Jaccard's score. - Jaccard's Score of DOC0 and DOC0
1.000000Jaccard's Score of DOC0 and DOC1
0.120000Jaccard's Score of DOC0 and DOC2
0.120000Jaccard's Score of DOC0 and DOC3
0.115384Jaccard's Score of DOC0 and DOC4
0.083333Average Fitness (Jaccard's Score) of
Document0 0.28774
184. A GA Example (cont.)
- If a user provided documents which are closely
related, the average fitness for the complete
document set was high. If the user-selected
documents were only loosely related, their
overall fitness was low. - Generally, GAs did a good job optimizing a
document set which was initially low in fitness.
Using the previous example, the overall Jaccard's
score increased over generations. The optimized
population contained only one single chromosome,
with an average fitness value of 0.45121.
194. A GA Example (cont.)
- The optimized chromosome contained six relevant
keywords which best described the initial set of
documents. - Using these optimized'' keywords, an
information retrieval system could proceed to
suggest relevant documents to users. The user-GA
interaction continued until a search was
completed or the user decided to stop.
204. A GA Example (cont.)
- . Optimized Chromosomes in the Population
- chromosome fitness
- 000000000001000100000010101000001
0.45121000000000001000100000010101000001
0.45121000000000001000100000010101000001
0.45121000000000001000100000010101000001
0.45121000000000001000100000010101000001
0.45121 - Average Fitness 0.4512
- . Derived Concepts from Optimized Population
- RETRIEVAL, US, INDEXING, KEVIN.HOT, INFORMATION
RETRIEVAL SYSTEMS, STAIRS,
215. Conclusion
- The GA can be successfully implemented in the
field of information retrieval, many approaches
could be used that implement GA in the field of
information retrieval. And a continuous study is
required in this field also in the future the
test of this algorithm should be on a large
database.
226.References
- Ricardo Baeza-Yates, Modern information retrieval
- Hsinchun Chen , Machine Learning for Information
Retrieval Neural Networks, Symbolic Learning,
and Genetic Algorithms ,Journal of the American
Society for Information Science, 1994, in press.
http//ai.arizona.edu/papers/mlir93/mlir93.html - Bangorn Klabbankoh, Ouen Pinngern Ph.D., Applied
Genetic Algorithms in Information Retrieval.
International Journal of the computer, the
Internet and Management. http//www.ijcim.th.org/p
ast_editions/1999V07N3/02-drouen.pdf - Robert Losee. Learning Syntactic Rules and Tags
with Genetic Algorithms for Information Retrieval
and Filtering An Empirical Basis for Grammatical
Rules, Information Processing Management,
32(2), pp. 185-197, 1996. (published article)
http//www.ils.unc.edu/losee/gene1.pdf - Eric Krevice Prebys, The Genetic Algorithm in
Computer Science. International Journal of the
computer, the Internet and Management.
http//www-math.mit.edu/phase2/UJM/vol1/PREBYS-F.P
DF - Matthew. http//lancet.mit.edu/mbwall/presentatio
ns/IntroToGAs/P002.html - D. Vrajitoru (1997) Genetic Algorithms in
Information Retrieval. AIDRI97 Learning From
Natural Principles to Artificial Methods. ,
Genève, June 1997. http//www.cs.iusb.edu/danav/p
apers/AidriEng.pdf