Title: Organizing current awareness in a large volunteer-based digital library
1Organizing current awareness in a large
volunteer-based digital library
- Thomas Krichel
- 2006-02-27
2outline
- Background to work that we did
- RePEc (Research Papers in Economics)
- NEP New Economics Papers
- The research
- Theory
- Method
- Results
- Other work done for NEP.
3This talk has three parts
- Some background
- Two papers
- chablis paper, with Nisa Bakkalbasi (Yale)
- http//openlib.org/home/krichel/papers/chablis.pdf
- shibuya paper
- http//openlib.org/home/krichel/shibuya.pdf
4RePEc
- Digital library for academic Economics. It
collects descriptions of - economics documents (working papers, articles
etc) - collections of those documents
- economists
- collections of economists
5RePEc principle
- Many archives
- Archives offer metadata about digital objects or
authors and institutions data. - One database
- Many services
- Users can access the data through many
interfaces. - Providers of archives offer their data to all
interfaces at the same time. This provides for an
optimal distribution.
6it's the incentives, stupid
- RePEc applies the ideas of open source to the
construction of bibliographic dataset. It
provides an open library. - The entire system is constructed in such a way as
to be sustainable without monetary exchange
between participants.
7some history
- Thomas Krichel in the early 1990s dreamed about a
current awareness service for working paper. It
would later have electronic papers. - In 1993 he made the first economics working paper
available online. - In 1997 he wrote the key protocols that govern
RePEc.
8RePEc is based on 550 archives
- WoPEc
- EconWPA
- DEGREE
- S-WoPEc
- NBER
- CEPR
- Elsevier
- US Fed in Print
- IMF
- OECD
- MIT
- University of Surrey
- CO PAH
- Blackwell
9to form a 362k item dataset
- 171,000 working papers
- 187,000 journal articles
- 1,300 software components
- 2,100 book and chapter listings
- 9,000 author contact publication listings
- 9,300 institutional contact listings
- more records than
arXiv.org
10RePEc is used in many services
- EconPapers
- NEP New Economics Papers
- Inomics
- RePEc author service
- Z39.50 service by the DEGREE partners
- IDEAS
- RuPEc
- EDIRC
- LogEc
- CitEc
11NEP New Economics Papers
- This is a set of current awareness reports on new
additions to the working paper stock only.
Journal articles would be too old. - Founded by Thomas Krichel in 1998.
- Supported by the Economics department at WUStL.
- Initial software was written by Jose Manuel
Barrueco Cruz. - First general editor was John S. Irons.
12why NEP
- Public aim Current awareness if well done, can
be an important service in its own right. It is
sheltered from the competition of general search
engines. - Private aim It is useful to have some, even
though limited classification information. - for performance measures
- for general research purposes
13modus operandi stage 1
- The general editor uses a computer program who
gathers all the new additions to the working
paper stock. This is usually done weekly. - S/he filters out new descriptions of old papers
- date field
- handle heuristics
- The result is an issue of the nep-all report.
14modus operandi stage 2
- Editors consider the papers in the nep-all report
to filter out papers that belong to the subject.
This forms as issue of a subject report nep-???. - nep-all and the subject reports are circulated
via email. - A special arrangement makes the data of NEP
available to other RePEc services.
15some numbers
- The are now 60 NEP lists.
- Over 39k subscriptions.
- Over to 16k subscribers.
- Over 50k papers announced.
- Over 100k announcements.
- Homepage at http//nep.repec.org
- All this is a fantastic
success!!
16problem with the private aim
- We would have to have all the papers to be
classified not only the working papers. - We would need to have 100 coverage of NEP.
- This means every paper in nep-all appears in at
least one subject report.
17coverage ratio
- We call the coverage ratio the number of papers
in nep-all that have been announced in at least
one subject report. - We can define this ratio
- for each nep-all issue
- for a subset of nep-all issues
- for NEP as a whole
18coverage ratio theory evidence
- Over time more and more NEP reports have been
added. As this happens, we expect the coverage
ratio to increase. - However, the evidence, from research by Barrueco
Cruz, Krichel and Trinidad is - The coverage ratio of different nep-all issues
varies a great deal. - Overall, it remains at around 70.
- We need some theory as to why. This is where the
chablis paper comes in.
19two theories
- Target-size theory
- Quality theory
- descriptive quality
- substantive quality
20theory 1 target size theory
- When editors compose a report issue, they have a
size of the issue in mind. - If the nep-all issue is large, editors will take
a narrow interpretation of the report subject. - If the nep-all ratio is small, editors will take
a wide interpretation of the report subject.
21target size theory static coverage
- There are two things going on
- The opening new subject reports improves the
coverage ratio. - The expansion of RePEc implies that the size of
nep-all, though varying in the short-run, grows
in the long run. Target size theory implies that
the coverage ratio deteriorates. - The static coverage ratio that we observe is the
result of both effects canceling out.
22theory 2 quality theory
- George W. Bush version of quality theory
- Some papers are rubbish. They will not get
announced. - The amount of rubbish in RePEc remains constant.
- This implies constant coverage.
- Reality is slightly more subtle.
23two versions of quality theory
- Descriptive quality theory papers that are badly
described - misleading titles
- no abstract
- languages other than English
- Substantive quality theory papers that are well
described, but not good - from unknown authors
- issued by institutions with unenviable research
reputation
24practical importance
- We do care whether one or the other theory is
true. - Target size theory implies that NEP should open
more reports to achieve perfect coverage. - Quality theory suggests that opening more report
will have little to no impact on coverage. - Since operating more reports is costly, there
should be an optimal number of reports.
25overall model
- We need an overall model that explains subject
editors behavior. - We can feed this model with variables that
represent theoretical determinants of behavior. - We can then assess the strength of various
factors empirically.
26method
- The dependent variable is announced. It is one if
the paper has been announced, 0 otherwise. - Since we are explaining a binary variable, we can
use binary logistic regression analysis (BLRA).
This is a fairly flexible technique, useful when
the probability distributions governing the
independent variables are not well known. - That's why BLRA is popular in the life sciences.
27independent variables size
- size is the size of the nep-all issue in which
the paper appeared. - This is the critical indicator of target size
theory. We expect it to have a negative impact on
announced.
28independent variables position
- position is the position of the paper in the
nep-all issue. - The presence of this variable can be justified by
the combined assumption of target size and editor
myopia. - If editors are myopic, they will be more liberal
at the start of nep-all then at the end of
nep-all.
29independent variables title
- title is the length of a title of the paper,
measured by the number of characters. - This variable is motivated by descriptive quality
theory. A longer title will say more about the
paper than a short title. This makes is less
likely that a paper is being overlooked.
30independent variables abstract
- abstract is the presence/absence of an abstract
to the paper. - This is also motivated by descriptive quality
theory. - Note that we do not use the length of the
abstract because that would be a highly skewed
variable.
31independent variables language
- language is an indicator if the language of the
metadata is in English or not. - This variable is motivated by descriptive quality
theory and the idea that English is the most
commonly understood language. - While there are a lot of multilingual editors,
customizing this variable would have been rather
hard.
32independent variables series
- series is the size of the series where a paper
appears in. - This variable is motivated by substantive quality
theory. - The larger a series is the higher, usually, is
its reputation. We can roughly qualify by size
and quality - multi-institution series (NBER, CEPR)
- large departments
- small departments
33independent variables author
- author is the prolificacy of the authors of the
paper. - It is justified by substantive quality theory.
- This is the most difficult variable to measure.
We use the number of papers written by the
registered author with the highest number. - Since about 50 of the papers have no registered
author, a lot of them are excluded. But there
should be no bias by the exclusion.
34create categorical variables
- size_1 179, 326)
- size_2 326, 835
- title_1 55, 77)
- title_2 77, 1945
- position_1 0.357, 0.704)
- position _2 0.704, 1.000
- series_1 98, 231)
- series_2 231, 3654
35results
- P(announced1 x) (exp(g(x))/(1exp(g(x))
- g(x) 0.2401- 0.2774size_1 - 0.4657 size_2
0.1512title_1 0.2469title_2 0.3874abstract
0.0001author 0.7667language
-0.1159series_1 0.1958series_2 - position is not significant. author just makes
the cut.
36odds ratio
- size_1 1.32 1.22, 1.44
- size_2 0.83 0.76, 0.90
- title_1 1.16 1.07, 1.26
- title_2 1.28 1.18, 1.39
- abstract 1.47 1.34, 1.62
- language 2.15 1.85, 2.51
- series_1 1.11 1.02, 1.20
- series_2 1.37 1.26, 1.49
- author 1.05 1.01, 1.09
37scandal!
- Substantive quality theory can not be rejected.
That means that the editors are selecting for
quality as well as for the subject. - The editors have rejected our findings. Almost
all protest that there is no quality filtering. - This is where the chablis paper ends.
38consequences
- There has been no program to expand list.
- There has to be a concentrated effort to help
editors to find subject specific papers. - More effort needs to be made for editors to
really find the subject-specific papers. This can
be done by - the use of a more efficient interface
- the use of automated resource discovery methods.
39ernad
- editing reports on new academic documents. It is
purpose-built software system for current
awareness reports. - It has been designed by Thomas Krichel,
http//openlib.org/home/krichel/work/altai.html.
The design is complicated, but the system quite
easy to use. - The system was written by Roman D. Shapiro.
40statistical learning
- The idea is that a computer may be able to make
decision on the current nep-all reports based on
the observation of earlier editorial decisions. - ernad now works using support vector machines
(SVM), with titles, abstracts, author name,
classification values and series as features.
41SVM performance
- If we use average search length, we can do
performance evaluations. - It turns out that reports have very different
forecastability. Some are almost perfect, others
are weak. - Again, this raises a few eyebrows!
42what is the value of an editor?
- If the forecast is perfect, we don't need the
editor. - If the forecast is very weak the editor may be a
prankster.
43pre-sorting reconceived
- We should not think of pre-sorting via SVM as
something to replace the editor. - We should not think about it encouraging editors
to be lazy. - Instead, we should think it as an invitation to
examine some papers more closely than others.
44headline vs. bottomline data
- The editors really have a three stage process of
decision. - They read title, author names.
- They read the abstract.
- They read the full text
- A lot of papers fail at the first hurdle.
- SVM can read the abstract and prioritize papers
for abstract reading. - Editors are happy with the pre-sorting system.
45performance evaluation
- This is really where the shibuya paper starts.
- How should the success or failure of a sorting
algorithm be quantified? - Classic information retrieval suggests precision
and recall.
46precision and recall
- precision is the number of retrieved and relevant
documents divided by the number of retrieved
documents. - recall is the number of retrieved and relevant
documents divided by the number of relevant
documents. - Both numbers are used together but recall is
often difficult to measure.
47precision and recall problem
- Precision and recall really apply to "large" IR
problems, where the set of documents is too large
to be examined "by hand". Users only see the set
of retrieved papers. - Here we have a "small" information retrieval
problem.
48PR interpretation 1
- We can argue that when we sort nep-all recall is
always constant 100 - Precision is the number of relevant papers in the
issue, divided by the size of nep-all. This does
not depend on the sorting process.
49PR interpretation 2
- We can look at the precision achieved at the last
retrieved paper. This is a measure that is
equivalent to one measure I will present later,
that essentially looks at how low the last paper
has fallen. - But recall is still useless.
50PR interpretation 3
- We could the vector coming out of the sorting
process to a set. We can then compare - set of predicted useful documents
- set of actual used documents
- But this would mean deliberately throwing away
information. - And under this criteria different orders, which
should widely differ for editors, can get the
same evaluation.
51we need some different theory!
- We will look at some simple theory of editor
behavior. - This theory is a bit like an economic theory in
the sense that it has been made under
ridiculously simplifying assumptions. - The hope is that the theory sheds light into
basic features of the problem that remain
operational under more realistic assumptions.
52key assumption 1 binary decision
- An editor faces a list of documents. Each
document describes a working paper that has been
added to RePEc recently. The editor examines the
document. - An editor may spend a varying amount of effort
examining a document. This would be a very
complex decision to model. We assume it away. - Thus we assume a document is examined or not.
53key assumption 2 no learning
- The decision whether a document is relevant or
not is assumed to only depend on the contents of
that document. - It is assumed not to depend on the contents on
any other document. - This assumption assumes away learning.
54introducing cost-based reasoning
- Editors face an optimal stopping problem.
- There are two types of costs that editors are
facing. - the cost of examining a new paper c_1. We can
safely assume that c_1 is constant. - the cost associated with loosing papers c_2. It
will depend on the number of papers lost. It
c_2gt0, it will be unknown.
55c_1 and c_2
- c_1 and c_2 seem to dictate editor behavior
- If c_1 gtgt c_2 the editor will not examine any
documents. - If c_1 ltlt c_2 the editor will examine all
documents. - Let us assume that the editor is conscientious.
That is, c_1 and c_2 are such that, while there
is a chance that there are some more relevant
documents left, the editor will continue to
examine the list.
56the traffic light
- We still have a complicated problem. Only a
totally unrealistic assumption can safe us. - Basically, let us assume that there is no
uncertainty about c_2. This is the traffic light
assumption - A traffic light shows green as long as there are
more relevant documents to be discovered. - The traffic light shows red
57conscientious editor traffic light
- Under the traffic light scenario the
conscientious editor will examine papers until
the light shows red. - Therefore
- c_20
- examination cost is c_2 i where i is the
position of the last relevant paper is x.
58what have we learned?
- When presented with a series of outcomes, the
editor will prefer the one where the last
position of a relevant document is lower. - This defines a weak ordering over all outcomes.
59relaxing the traffic light
- Assume that there could be some uncertainty about
the traffic light at the end of the examination
process. - Assume that it is so small that the behavior of
the editor would be unchanged. - Contrast
- ranking A 10100
- ranking B 01100
- Then A should be preferred over B.
60the natural order
- Repeating the previous argument, we can find a
full ordering over all outcomes that a rational
and conscientious editor will have. - I am sure the optimality of that order could be
confirmed for more general scenarios. - But that is a matter of conviction.
61notation
- We consider a nep-all report has n papers.
- r of the papers are relevant.
- x is an outcome vector.
- x_i0 if the paper at position i is not relevant.
- x_i1 if the paper at position i is relevant.
62natural order when n5, r2
- 1 1 0 0 0 0 0 1 1 0
- 1 0 1 0 0 1 0 0 0 1
- 0 1 1 0 0 0 1 0 0 1
- 1 0 0 1 0 0 0 1 0 1
- 0 1 0 1 0 0 0 0 1 1
- (read column first)
63measuring success
- Let f(x) be a measure of the goodness of an
outcome. It appears natural to require - A f(x) gt f(x') if x is better than x'
- B f(1,,1,00) 1
- C E f(x) 0, where E is the expected value
- operator about the entire set of
outcome. - D respect for the natural order
- C calls for a closed form of the expected
value.
64Brookes Swets measures
- Brookes and Swets measure on z, the internal
ranking variable. The measure the true
discriminating value of z. - It is difficult to build a measure by
transformation that satisfies B and C. - It will not satisfy D.
65the average search length
- This is the average position of a relevant
document, divided by n. - This can be transformed to satisfy B and C.
- The problem remains that it does not satisfy D.
- Using a simple change such as taking the
logarithm of the position does not help.
66Cooper's expected search length
- This (roughly) is the number of non-relevant
documents found until a target number of relevant
documents has been found. - This can be transformed to satisfy B, C,
- It can weakly impose D. But all outcomes where
the same document is at the last position are
considered the equivalent. - This is a problem.
67natural order implementation I
- One way is to use powers. Construct a penalty
- yx_nyx_n-1 yx_1 where ygt1.
- It is possible to find the expected value of this
expression and construct a measure that satisfies
B, C, and D. - Exact values depend on y.
68natural order implementation II
- Another way is to count the items in the natural
orders, starting at zero say. - Finding the expected value is trivial, in this
case. - But we need an algorithm that quickly finds the
position of an outcome in the order. Such an
algorithm is described in the paper.
69test
- We extract author names, titles, abstracts,
series id, and classification codes. - We do a straight feature count, then normalize
for the Euclidian norm. - We set aside 300 observations for testing, the
rest for learning. - We use SVM_light. We conduct 100 tests per
report.
70results
- Coopers measure does worse than the linear
measures such as the average search length. - The direct imposition measures show very high
values many times. This is the case when they
have been able to lift the last observation, say,
into the first half.
71conclusion
- Since
- Cooper's measure and the direct imposition
measure essentially measure the same order, - Cooper's measure gives relatively low values,
- direct imposition measures give high values
- I conclude that a linear combination of Cooper's
measure and direct imposition measure II seems
the way forward to measure performance.
72to do list
- Answer the question Why did I ever get into this
rather convoluted topic ? - But now we have a criterion, we can seen if we
can improve by other methods - bigrams and RePEc keyword values
- different SVM settings
- different algorithms
73http//openlib.org/home/krichel/
- Thank you for your attention!