Informationswissenschaft

About This Presentation

Title:

Informationswissenschaft

Description:

they need including peer-reviewed articles, author home pages and university sites ... www.scirus.com, scirus white paper. FAST Search & Transfer (http://www. ... – PowerPoint PPT presentation

Number of Views:2337

Avg rating:3.0/5.0

Slides: 25

Provided by: BU49

Category:

more less

Transcript and Presenter's Notes

Title: Informationswissenschaft

1
Wie funktioniert die wissenschaftliche
Suchmaschine www.scirus.com?

Ein Referat von
Alexander Malek
Rebecca Mohr
Mara Kahrens
Holger Klöpfel

2
Scirus

Betrieben von Reed Elsevier
Weltweiter größter Fachinformationskonzern,
entstanden aus der Zeitungspapierfabrik Reed in
UK und dem Zeitungsverlag und Druckerfamilie
Elsevier in NL
Benutzt die Fast Suchtechnologie
Technologie von Search Transfers
Seit 1999 erstmals bei AllTheWeb eingesetzt,
Startindex 200 Mio. Sites
Firmen, die die Fast-Technologie benutzen ATT,
eBay, BroadVision, FirstGov, Freeserve, IBM,
InfoSpace, Reuters, T-Online, Terra Lycos, Tiscali

3
Scirus

Der Wissenschaft gewidmet
Kostenlose Suche
Hochentwickelte Suchtechnologie
Weltweit verständlichster wissenschaftsspezifische
r Index
Kombination von Website-Resourcen und
Artikel-Datenbanken
Individuelle Quellenwahl USPTO, ScienceDirect,
Preprints, NASA etc.
Dokumentart Konferenzberichte, Artikel, Patente
etc.
Suchverfeinerung
Sortierung nach Relevanz oder Datum
Resultate per Mail zuschicken
Forscher und Studenten finden leicht und
effizient die Informationen, die sie brauchen
Scirus, a free Web search engine dedicated to
science, uses sophisticated search technology to
create the worlds most comprehensive
science-specific index. The Scirus Index is
comprised of a unique combination of free Web
sources and article databases.
Scirus is the most comprehensive science-specific
Web search engine available on the
Internet. Driven by the latest search engine
technology, it enables researchers and
students searching for scientific information to
chart and pinpoint the information
they need including peer-reviewed articles,
author home pages and university sites
quickly and easily.

4
Scirus

Suche kann ein langer, komplizierter Prozess sein
die Spezial-Suchmaschine liefert eine produktive
Suche
Augenmerk auf fachspezifische Daten
Durchsucht das Deep-Web
Eliminierung irrelevanter Daten
Stetige Weiterentwicklung der Such-Technologie
The vast amount of information available on the
Internet can make searching a long, complicated
process. Speciality search engines like Scirus
provide a more productive
search by
Focusing only on sites with subject-specific
data.
Searching the deep web.
Filtering out irrelevant data.
Scirus, like the Internet itself, is a
work-in-progress. We work in partnership with
FAST to keep pace with evolving technology so
that our users can tap into the vast pool of
scientific information available on the Internet.
Through an in-depth process of filtering
information, classification and ranking Scirus is
able to pinpoint information more precisely than
any other science-specific search engine.

5
Specialty/ vertical/ topical search engine

Wissenschaftliche Informationen im Web zu finden
ist mit Scirus schnell und effizient
Konzentration auf Websites mit wissenschaftlichem
Inhalt
Suche nach kostenlosen Informationsquellen, wie
Homepages von Wissenschaftlern und
Universitätsseiten
Suche in der weltgrößten Datenbasis für
wissenschaftliche, technische und medizinische
Zeitschriften
Pre-prints und rezensierte Artikel und Patente
Eine intuitive (selbsterklärende)
User-Schnittstelle und eine Suche für
Fortgeschrittene
Liest Nicht-HTML-basierte Dateien und somit den
Zugriff auf PDF und PostScripts erlaubt, die für
andere Suchmaschinen oft unsichtbar sind
Speciality search engines also called vertical
or topical Web search engines focus
on specific subject areas. Elsevier has worked in
partnership with FAST to create
unique processes that ensure that Scirus is the
most comprehensive science-specific
index on the market today. Locating scientific
information on the Web is fast and
efficient with Scirus because it
Focuses only on websites containing scientific
content and indexes those sites
in-depth.
Searches the world-wide-web for free sources of
information such as scientist
home pages and university websites.
Searches the worlds largest database of
scientific, technical and medical journals.
Locates pre-print, peer-reviewed articles and
patents.

6
Pinpointing results inverted pyramid
Seed List Focused Crawling es wird nur nach
bestimmten Begriffen auf bestimmten Seiten gesucht
Datenbasen Kooperations-partner
Suchanfrage des Users
7
Seed List

die Basis auf der die Crawler von Scirus das
Internet durchsuchen. Sie ist eine Liste von
überprüft wissenschaftlichen Websites.
Mehrere Methoden führen zu der Seed List
Elsevier stellt regelmäßig eine Auswahl ihrer neu
veröffentlichen Sites bereit
Es werden auch manuell Sites eingegeben
Vorschläge für neue Einträge kommen auch von
Usern und Webmastern
Ist schon an der URL zu erkennen, dass die Site
wissenschaftlich ist, wird sie übernommen
Neue Links auf schon indexierten, oft
aufgerufenen Sites werden verfolgt
The seed list is the basis on which Scirus crawls
the Internet. The Scirus seed list is
created by a number of methods including
Elsevier publishing units are periodically
asked to supply a list of sites in their
subject area.
Members of the Scirus Scientific, Library and
Technical Advisory Boards provide
input on an ongoing basis.
Webmasters and Scirus users regularly submit
suggestions for new sites.
Easily identifiable URLs (such as
www.newscientist.com) are added on a regular
basis.

8
Focused Crawling

Der Robot/ Spider/ Crawler liest den Text auf den
Websites, deren URLs sich in der Seed List
befinden
Er verfolgt auch nur Links, die sich in der Seed
List befinden
So werden auch die Updates gemacht
Der Prozess, dem der Robot folgt
Ein Scheduler skedjula koodiniert, welche Sites
Priorität besitzen und welche Teile laut
Webmaster durchsucht werden dürfen (robots.txt im
HTML-Quelltext) und begrenzt die Anzahl der
Anfragen, die ein Robot an einen Server senden
darf.
Eine Crawler Farm, ein unabhängiger
Maschinen-Knoten, durchsucht das Web.
Der Robot sammelt Dokumente und schickt sie zum
Index
und speichert eine Kopie, so dass Scirus beim
Suchergebnis einen Ausschnitt mit dem Suchwort
zeigen kann (dynamic teaser/ keyword in context)
Scirus uses a robot also known as spiders or
crawlers to read the text on the
sites found on the seed list.
Unlike general search engines, the Scirus robot
doesnt follow links unless those
domains are also on the seed list. This type of
focused crawling ensures that only
scientific content is indexed. For instance, if
Scirus crawls www.newscientist.com
it will only read pages that fall under that
domain. It doesnt crawl a link to
www.google.com because that URL isnt on the seed
list.
The Scirus robot crawls the Web to find new
documents and update existing

9
Quellen

Scirus beinhaltet gt 138 Mio. Seiten
Wissenschaftliche Web-Quellen
120 Mio. frei zugängliche Web-Seiten
Wissenschaftliche Zeitschriften und
Preprint-Quellen
18 Mio. Artikel

10
Datenbasen

Während der Robot die Seed List abarbeitet, lädt
Scirus Daten von wissenschaftsspezifischen
Quellen
Kooperationspartner
ScienceDirect, MEDLINE on BioMedNet, Beilstein on
ChemWeb, BioMed Central, US Patent Office
Open Archive Initiative (OAI)
OAI sorgt für die Erleichterung der
Inhalts-Verbreitung
Quellen arXiv.org, NASA (incl. NACA and LTRS),
CogPrints, The Chemistry Preprint Server, The
Computer Science Preprint Server, The Mathematics
Preprint Server
While the robot crawls the seed list, Scirus
loads data from science-specific sources.
The loaded data consists of both partnership and
Open Archive Initiative (OAI)
sources. Partnership sources currently include
ScienceDirect, MEDLINE on BioMedNet,
Beilstein on ChemWeb, BioMed Central, and the US
Patent Office.
The OAI develops and promotes interoperability
standards to facilitate the efficient
dissemination of content. OAI sources currently
include arXiv.org, NASA (incl. NACA and LTRS),
CogPrints, The Chemistry Preprint Server, The
Computer Science Preprint Server and The
Mathematics Preprint Server.
The Scirus database will continue to expand to
ensure that as many science-specific
Web sources as possible are included in the index.

11
Quellen-Anteile

Kooperationspartner
Beilstein 650.000 Abstracts
BioMed Central 1250 Volltext-Artikel
MEDLINE 13 Mio. Zitationen
ScienceDirect 3 Mio. Volltext-Artikel
USPTO 950.000 Patente
OAI sources
E-print ArXiv 320.000 Preprints (Los Alamos)
Math Preprints 566
Chem Preprints 563
CompSci Prepr. 280
Cogprints 1.425
NASA 11.000

12
Classification

Durch die Klassifizierung sucht man bei Scirus
zielgerecht
Die Dokumente werden in Fachbereiche eingeteilt,
momentan 20 Stück, die alle Wissenschaftsfelder
abdecken
Es wird gekennzeichnet, ob es sich bei dem
Dokument um eine Homepage oder einen Artikel
handelt, so dass keine ungewollten Seiten
geöffnet werden
The robot gathers all the pages and puts them
into a working index. Scirus reads
every word that appears on the site and examines
where the word appears on the
site (title, URL, text). Once the seed list has
been crawled and the database has been
loaded Scirus is ready to classify the data.
The classification process improves the retrieval
of science-specific pages and allows the user to
perform searches that are
targeted towards specific scientific domains or
document types.
Scirus performs document classification following
two different schemes
The subject classification identifies
scientific domain descriptors that can
be assigned to a document. Currently there are 20
subject areas available
for selection such as Medicine, Physics or
Sociology covering all major
fields of science.
The information type classification assigns a
document type such as scientists'
homepage or scientific article. This narrows the
search to specific kinds of
documents and prevents the retrieval of unwanted
pages.

13
Subject Classification

Klassifizierung nach Fachgebieten
Innerhalb der Fachgebiete sind Lexika mit
Fachwörtern angelegt, nach denen gesucht wird
An vorklassifizierten Korpora getestet, können
diese Lexika die statistische Gewichtung der
Termini bestimmen
Diese Lexika werden nicht nur für die
Klassifizierung benutzt, sondern auch, um die
wichtigsten Schlüsselwörter eines Dokumentes zu
liefern
Die Meta-Informationen, wie die URL und der
Ankertext um den Link herum, der auf die Site
verwies, verfeinert die Klassifikation
Der Algorithmus erlaubt die Zuteilung eines
Dokumentes zu mehreren Kategorien, da sich
Disziplinen überschneiden können
Subject Classification
Scirus maintains a customized linguistic
knowledge base for each subject area. The
vocabulary on the pages is mapped against the
terms from dictionaries which have
been compiled through training on a very large,
manually pre-classified corpus of
scientific texts. The dictionaries are
supplemented with entries from domain-specific
terminological databases.
The vector terms are weighted single word and
multiword expressions. The weight
of the classification terms is determined by
examining statistical properties from the
training corpus such as the classification
strength and through partial manual
maintenance.
In addition to their classification task, the
dictionaries are also used to determine and
normalise terms for providing the main keywords
relevant for the document.

14
Information Type Classification

Informationstyp-Klassifizierung
benutzerdefinierte Software zur Analyse des
Seitenprofils und zur Klassifizierung des
Informationstypes (z.B. Abstracts, Volltexte d.
Artikel, Homepages,)
Analyse der Struktur und des Vokabulars zur
Zuordnung zu Kategorien
Name, Adresseninformationen, biographische Daten
und Publikationslisten helfen bei der
Kategorisierung
Information Type Classification
Scirus uses custom software to analyze the
profile of a page and classify the information
type. Types that are recognized include
scientific abstracts, full text scientific
articles,
scientists home pages, conference announcements
and other page types that are
relevant to the scientific domain. The
classification algorithm analyses the structure
and
the vocabulary of a page to assign one of the
categories. For instance, scientists home
pages can be recognised by looking at structural
information such as the presence of
address information, biographical data layout,
publication lists and by the presence of
keywords like homepage, publication list etc.
The structural analysis also allows the
extraction of certain information chunks from the
analysed pages. In the case of a scientists
homepage the module will attempt to extract
information like the name and affiliation of the
owner of the page and add it to the
document attributes.

15
Wachstum
Die Seed List und die Anzahl der indexierten
Sites wachsen stetig seit dem ersten Einsatz von
Scirus im April 2001
16
Suchanfrage (Query)

Intelligent query rewrites verbessern, wenn der
User so will, das Ranking durch automatische
Neuformulierung der Suchanfrage.
Es werden Phrasen aus dem Lexikon übernommen und
nicht wesentliche Wörter gestrichen (z.B. Was
ist, Wo kann ich Information zu .finden)
Scirus improves the ranking and relevance of
results by implementing intelligent
query rewrites which are designed to
automatically understand the intention of the
user and enable more intelligent searching by
rewriting the queries.
Query transformations performed on the searchers
behalf include the addition of quotes
around common phrases that are detected from the
Scirus phrase dictionary and the
removal of non-essential search words in the
query such as what is and
where can I find information about. Searchers
have the option of running a query
without the re-writing function.

17
Intelligent rewrites screenshot
18
Suche

Basic Search
Boolesche Operatoren und Feldnamen
Exakte Phrasensuche
Alle Quellen oder Quellen eingrenzen
Refine Search
Relevante Klassifizierungs-Fachbegriffe der
gefundenen Dokumente können benutzt werden, um
die Suche zu verfeinern
Advanced Search
Zur Benutzerdefinierung
Suche in den 20 Fachgebieten
Suche in einer speziellen Datenmenge
Informationstyp (z.B. Abstracts, Konferenzen,
Patente, Websites,)
Suche nach spezifischen Quellen (z.B.
Zeitschriften von BioMed Central, Websites der
Nasa)
Suche nach Zeitschriftentiteln oder Autorennamen
Bei Kooperations-Datenbanken nach Datum sortieren
Scirus has a wide range of features to help users
pinpoint the information theyre
looking for.
Basic Search
The basic search function enables users to
specify

19
Advanced search screenshot
20
Static Ranking Wort Analyse

Das Ranking basiert auf zwei Werten Wörter und
Links
Häufigkeit und Position der Wörter
im Titel
im Text
im Link
Um zu gewährleisten, dass Volltext-Artikel nicht
automatisch höher gewichtet werden als Abstracts,
zählt Scirus die Anzahl der Schlüsselwörter und
teilt diese durch die Gesamtanzahl der im Text
vorkommenden Wörter
Kurze URLs sind Ausschlag gebender als längere
Je näher die Schlüsselwörter der Suchanfrage
zueinander sind, desto höher im Ranking
Scirus uses an algorithm to rank the documents
resulting from a query. Algorithms
are procedures, or formulas, used to solve a
problem. Ranking is based on two basic
values term and links.
Term Frequency
For term value, the location and frequency of
occurrence of the terms within the
document are measured. The global frequency of
the term within the whole index
is also taken into consideration. Scirus asks the
following questions when looking at
term location and frequency

21
Dynamic Ranking Link Analyse

Link-Struktur und Ankertexte um die Links herum
werden fürs Ranking ausgenutzt
Je mehr Links zu einer Site führen, desto höher
die Site im Ranking
Datenbasen sind ausgenommen, werden nicht
gecrawlt
Das Ergebnis mit dem höchsten Rank-Wert wird als
erstes aufgelistet
Jedes Dokument, das auf eine Suchanfrage
anspricht, bekommt einen Wert
Statisches Ranking wird, basierend auf der bloßen
Analyse des Dokumentes, zugeordnet
Dahingegen basiert dynamisches Ranking auf der
Position der Schlüsselwörter der Suchanfrage
innerhalb eines Dokumentes
Dynamisches und statisches Ranking machen je 50
des Rankings aus
Link Analysis
Scirus uses link analysis as part of its
relevancy ranking system. For link value, the
number of links to a page is analysed. The
cardinality or importance of a page is
determined by calculating the number of links to
a page. The more links to the page,
the higher the ranking. Scirus also analyses the
anchor text the text of a link or
hyperlink to determine the relevance of a site.
Because pages in the database load arent
crawled, it isnt possible to conduct a link
analysis. These pages are assigned a static
score. Every time a new Scirus Index is
created the static score is examined for
relevance.
Scirus uses a special general terms dictionary
with select scientific terms to identify

22
Suchergebnisse

Bei der Ergebnisanzeige gibt es verschiedene
Anzeigemöglichkeiten
Dieselbe Domain wird, trotz mehrfacher Treffer,
nur einmal genannt. Vornehmlich die kürzeste URL.
More Hits from. führt zu den restlichen
Treffern.
Dynamic teasers (relevant text) mit den
hervorgehobenen Suchwörtern
Die Quellen als URL oder als Icon der
Kooperationspartner
To ensure that results analysis is an efficient
process, Scirus presents results in a
number of innovative ways
It collapses the site to prevent returning
multiple pages of the same Website.
Although the content is different, pages from the
same domain often look alike.
If the user clicks more hits from at the end of
the citation Scirus will display
more matching results from the same Website.
Dynamic teasers (also called relevant text)
return the part of the result relevant
to the query and highlight the search terms. For
example, if you search for
genetic manipulation the first result has the
following teaser
Sources are branded so that it is clear whether
the results are from the Web or
loaded databases. For example, BioMed Central
results are displayed as follows