Title: Week 1 You'll always find what you want
1Week 1 You'll always find what you want
2A deep and uncharted webSpace
- The web is huge
- The problem is that time and space have a
different meaning on the web. - Everything that 'happens' is carved forever
- try to pull something 'off the web'
- Everything you write and publish will defy
eternity, carved in electrons the very moment
you put something on the web, someone, somewhere,
will make a copy out of it. - It is bound to reappear, somewhere sometime
indestructible and redoutable powers of the void.
3A deep and uncharted webTime
- Time is different too.
- As anyone that has real web-experience knows,
something you wrote, or published, remains
unanswered - and apparently uncared of - for
months, or years... and then, all of a sudden,
when you almost forgot it yourself, a dozen
persons begin contacting you out of the void,
with an enormous and for you inexplicable
interest in what you wrote so long ago.
4(No Transcript)
5Structure
- Four main different areas
- Core
- Hidden databases
- Outside linkers
- Outside linked
- Different techniques are used to access these
different areas.
6Hidden databases
- Hidden databases.
- These are pages that the Nucleus points
- May (or may not) point back to the Nucleus.
- For access-restrictive reasons visitors of sites
located here are supposed to "pay" in order to
access them. As you may imagine, these pages are
NOT mutually linked. - Fortunately the web was originally built in order
to share knowledge. - The building blocks, the "basic frames" behind
the structure of the web are still the same
regardless of the commercial twist.
7Hidden databases
- If I may dare a comparison exactly as it is
pretty easy to break any software protection
written in a higher language if you know (and
use) assembly, so it is easy to break any
server-user delivered barrier to a given database
if you know (and can outflank) the protocols used
by browsers and servers. - As a result let's simply say that for some it is
relatively easy to access all pages in this area
reversing the (simple) perl or javascript tricks
used to keep them "off limits" . - (You wont even have to recurr to common exploits
à la "politically correct" -)
8Outside linked
- The "outside linked".
- The sites in this area are linked from the
Nucleus but do not point back to it. - For instance, the elements of a database of
images, linked from the Nucleus but not
necessarily pointing back to it. - These pages are "outside" the nucleus, yet not
particularly difficult to find.
9Outside linkers
- Like matter and anti-matter the "outside linked"
correspond an inversed related part of the web
the "outside linkers" pages. - Indeed all the pages located in this specific
area of the web do "point" to the Nucleus but are
not pointed back from it. - Imagine as an example the personal links page of
a scientist lotta interesting links to the
Nucleus yet no need to publicize its existence. - A page with information you may need is there,
somewhere, without any link whatsoever that could
bring you to it. Indeed there are no links back
from the Nucleus to these pages.
10Outside linkers
- The "outside linkers" are a part of the web you
cannot reach using "normal" search techniques,
since no link whatsoever points to them. - Yet they may hoard knowledge you need. There are,
fortunately, some techniques that you can apply
in order to find them, the most simple and common
one being 'klebing' - Klebing is using the information found inside the
referral fields of site loggings and statistics
when your target visits your site. The trick is
to lure your potential targets to an interesting
page you create and keep it updated until they
land there and unsuspectingly leave an entry in
your web log or stat server.
11The Triad
12Why Teoma
- Because you can refine, refine, refine... and
that's it! - This is NOT just a simple "search within these
results" thingy - Best choice for starting a query on BROAD topics.
- Teoma is resistant to index spamming (a huge
problem for Google). - Teoma eliminated all advertisement banners and
interstitials (popups) in January 2003 - While Google's 'global linking' gives credit to
every link equally, Teoma (should) instead find
'the links that count'... and count them. - Teoma "creates on the fly clusters of web pages
into topic specific web communities"
13Why Google
- With google you can forget hyperlinks just find
pages selecting a set of very peculiar words that
uniquely identify a given page. - Google searches inside .pdf, .doc, etc. files.
Moreover, it locates the text most relevant to
your specific query and highlights your keywords
and its context
14Why Fast
- Because it is fast.
- Because it covers parts of the web that are not
covered by google. - Because it is less polluted by the useless
commercial sites. Because it mines the "deep" and
"obscure" web more than google or teoma. - Because it is less censored than google. Because
it is simply the best main engine for multi-words
complex searches.
15Searching
- 1) LOWERCASEAlways enter your search terms in
lower case (unless you want to limit your
search). Most search engine will thus find both
upper and lower case occurrences of your
searchstring. "pAris" is NOT the same as "paris"
16Searching
- 2) EXACT SEQUENCE ""Enclose terms in double
quotation marks if you want to retrieve those
exact terms in that exact sequence. This may be
very useful in order to find a specific page.
Thus "saerch engine" will retrieve some (11)
pages WITH THIS SAME MISSPELLING ERROR.
17Searching
- 3) NARROW DOWN AND and ELIMINATE
MERCILESSY AND NOT - Narrow your
searches by linking your search terms with AND or
, or simply use the plus sign . The search
engine will find only those pages that contain
all of your search terms. Similarly, exclude
pages that are not relevant to your search by
preceding the search term with AND NOT or or
simply use the minus sign -. "search engines"
hints tips techniques -tits -sex -"make money"
is better than the more simple "search engines"
hints tips techniques.
18Searching
- 4) DOWNSIDE OF THE BOOLEAN operatorsIt's often
difficult to specify exactly what you want to
include or exclude. You can also get unexpected
results if you are not careful about your use of
operators and parentheses. For example, the
search seeking OR searching AND finding is the
same as the search seeking OR (searching AND
finding). Both queries will find documents that
contain both searching and finding, together with
documents that contain the word seeking. However,
the query (seeking OR searching) AND finding is
not the same. It will find documents containing
the word finding and, in the same document,
either seeking or searching. Be careful with the
boolean operators!
19Searching
- 5) "PECULIAR" stringsYou should always strive to
use differentiating keywords when searching the
web. Words that are commonly used will not help
you much. Extremely common words like articles
and prepositions are so worthless that they are
completely ignored. Try to use words which
underline the peculiarity of your target. Common
words, when combined with boolean qualifiers, can
be very effective. You must identify the main
concepts in your topic and determine any
synonyms, alternate spellings, or variant word
forms for the concepts. Remember that the most
"peculiar" a word, the more useful it will be in
order to sharpen your search. title"search
strateg" hints tipsin this case we did
include the "search strateg" string (which
already has an elevate PEC) in the title
keyword.
20Searching
- 6) ASTERISKNote also the use of the asterisk
in the previous example it MUST be used
after at least 3 characters, it is valid for up
to 5 characters or as an element of a phrase.For
Altavista - Asterisk () After 3 specified characters will
search for matches in up to 5 trailing letters. - Question Mark (?) After 3 specified characters
will match exactly one more character. - Double Asterisk () More flexible as it will
search for matches for an unlimited number of
trailing characters. - You also have the ability use the wildcards
interchangeably and more than once in the same
search string
21Searching
- 7) STOP WORDSStop words are words such as "and"
"the" and "or" which search engines exclude from
their searches to make them more effective. These
terms are excluded because they are either
extremely common or they are used by the search
engine for performing more specialized searches.
Just think about how many documents on the Web
contain the word "the" and you'll understand how
important is a good stop words list for all
search engines.
22Errors you encounter
- 400 - Bad requestWhat does it mean?There's
something wrong with the URL you typed. Maybe the
server you're contacting doesn't recognize the
document you're asking for, maybe it doesn't
exist, or maybe you're not authorized to access
it. What can you do about it?Check the URL. Pay
special attention to uppercase and lowercase
letters, colons, and slashes. Here's a tip one
style convention many sites observe is to slap
initial capital letters on directory names but
not filenames. If you get this message
repeatedly, maybe the note you copied the URL
from mixed up its uppercase and lowercase. - 401 - UnauthorizedWhat does it mean?You're
probably accessing a site that's protected and
you're not on the host's preferred guest list or
you typed the password incorrectly. Some sites
also put a block on domain types--if you're not
from a .gov or .edu domain, for example, you may
not be able to gain access. What can you do
about it?If you're sure you're allowed in, try
again, and this time look at the keyboard when
you type. Passwords are often case-sensitive, so
if you've got your Caps Lock on, take it off. If
you're trying to break in, we don't want to know,
but the odds are stacked against you. - 403 - ForbiddenWhat does it mean?You may not be
allowed to access this document, probably because
it's either blocked to your domain or it's
password-protected. What can you do about it?If
you know the password, try again, carefully. If
you don't know the password but think you're
eligible for one, contact the site's Webmaster
and ask for it. - 404 - Not found What does it mean?The server
that hosts the site can't find the HTML document
at the end of the URL. It may be a simple case of
a mistyped URL, but it may also mean that the
document doesn't exist anymore. What can you do
about it?Try going one level up (deleting the
last part of the URL to the nearest slash) to see
if the site is live. If it is, check if there are
links to the document you're looking for. Failing
that, delete the last slash and type .html (or
shtml) instead, and see what that gives you. - 503 - Service unavailable What does it
mean?There are a variety of possibilities your
access provider's server may be down, your
company's gateway (the connection between the LAN
and the Internet) may be broken, or your own
system isn't working. What can you do about
it?This is usually an easy one wait a minute
and try again. If the error persists, identify
the culprit (access provider, gateway, or your
system) by process of elimination. - Bad file requestWhat does it mean?Your browser
supports forms complete with data-entry fields
and drop-down lists, but not the form you're
trying to access. Perhaps there's an error or
unsupported feature in the form. What can you do
about it?Send email to the Webmaster and try the
form again some other day
23Errors you encounter
- Cannot add form submission result to bookmark
list What does it mean?You've just entered a
search request and tried to save the result as a
bookmark. Though it may appear as a discrete
address, the result isn't a legitimate URL, so
you can't add it to your bookmark list. What can
you do about it?Try saving the result page as an
HTML page on your hard disk. Use the Save As
command then add the saved page to your bookmark
list. Depending on the CGI script behind the
query, you may or may not be successful. But it's
worth a try. - Connection refused by hostWhat does it mean?You
may not be allowed to access this document,
probably because it's either blocked to your
domain or it's password-protected. What can you
do about it?If you know the password, try again,
carefully. If you don't know the password but
think you're eligible for one, contact the site's
Webmaster and ask for it. - Failed DNS lookupWhat does it mean?The domain
name system can't translate the URL to a valid
Internet address. This is either a harmless blip
or the result of a mistyped URL (specifically, a
mistyped host name). What can you do about
it?Blips in DNS lookup are common, and often you
can rectify this by clicking the Reload button.
If that doesn't work, check your typing of the
URL carefully. If the problem persists, try again
after an hour or so. - File contains no dataWhat does it mean?The site
you've accessed is the right one, but there are
no Web page documents on it. You may have
stumbled upon this site just as updated versions
are being uploaded. What can you do about
it?Try the URL again, carefully. If that doesn't
help, try again in an hour. - Helper application not foundWhat does it
mean?Your browser doesn't recognize a file at
the Web or Net site you're visiting. Most
browsers can be extended using helper
applications (or viewers) to read files they
don't otherwise recognize. These files aren't
necessarily graphics--they can be sound files,
movie clips, or ZIP or SIT archive files you're
trying to download. What can you do about
it?The dialog box that carries this message will
usually give a clue about the file type that's
missing. (You may see some gibberish about octet
streams, but after that you'll probably see some
reference to graphic-TIFF, which gives it away.)
Look at CNET's Survival Kits for your computing
platform (Mac, PC, or Unix) for viewers for the
most common file types. Then follow your
browser's instructions for assigning a viewer for
each file format you wish to view online. - Host unavailableWhat does it mean?The machine
that hosts this site is probably down for
maintenance. What can you do about it?If at
first you don't succeed, hit Refresh or Reload
again and again. But wait a while between
refreshes.
24Errors you encounter
- Host unknown What does it mean?The server may
be down for maintenance, or you may have lost the
connection (your modem disconnected, or your
company's T1 line is choking). What can you do
about it?Hit the Reload button first. This is
often a blip in the Net. Then check the URL for
typos (and don't forget case-sensitivity). Then
make sure you're connected by hitting Reload,
which will re-establish connections in many
cases. - Network connection was refused by the serverWhat
does it mean?The server is probably too busy to
handle one more user, but it's not configured to
generate its own message, so this generic message
shows up instead. What can you do about it?As
always, try and try again. If that doesn't work,
wait as long as you can. Then try again. - NNTP server errorWhat does it mean?You're
trying to log on to a Usenet newsgroup, but you
can't get to it. The Usenet server is something
that's made available by your Internet service
provider, so it may be that this newsgroup isn't
available at all. What can you do about it?Make
sure you've typed the URL correctly. If that
doesn't help, try again later. If the problem
persists, contact your access provider and give
them a piece of your mind. - Permission denied What does it mean?You're
trying to upload a file to an ftp site, and the
site's administrator doesn't want you to.
Alternatively, you're using the wrong syntax when
trying to get a file. Or maybe the site is
currently too busy to handle your upload. What
can you do about it?First check that you used
the correct syntax. Then try again later. If the
problem persists, send email to the Webmaster and
ask how you can upload a file to that site. - Too many connections--try again later What does
it mean?This is another variation on the
rush-hour error message. You've picked the wrong
time to call, that's all. What can you do about
it?Do as it says--try again later, or keep
hitting the Refresh button until you succeed. - Too many users What does it mean?No ftp site
has unlimited access physical connections or
administrator policy allocate a number of
anonymous users to a given site. When that number
is exceeded, all who try to log on receive this
message. What can you do about it?Just keep
trying until you get lucky. However, on a busy
site (like Netscape's the week after a big
announcement) or one with very limited access
rights, you may be out of luck. If so, check to
see whether the site has mirrors, and try one of
those. - Unable to locate hostWhat does it mean?The
server may be down for maintenance, or you may
have lost the connection (your modem disconnected
or your company's T1 line is choking). What can
you do about it?Hit the Reload button first.
This is often a blip in the Net. Check the URL
for typos (and don't forget case-sensitivity),
then make sure you're connected by hitting
Reload, which will re-establish connections in
many cases.
25Errors you encounter
- Unable to locate the serverWhat does it
mean?You have either mistyped the URL, or the
server doesn't exist (you may have outdated
information). What can you do about it?Your
mission, should you choose to accept it enter
the URL again, looking at the keyboard as you
type. No luck? Check with your source to verify
that the URL is correct. - Viewer not found What does it mean?Your browser
doesn't recognize a file at the Web or Net site
you're visiting. Viewable files aren't
necessarily graphics--they can be sound files,
movie clips, ZIP or SIT archive files, and so on.
If it's not a GIF or JPEG file, your browser may
not know what it is. What can you do about
it?The dialog box that carries this message will
usually give a clue about the file type that's
missing. (You may see some gibberish about octet
streams, but after that you'll probably see some
reference to graphic-TIFF, which gives it away.)
Look at CNET's Survival Kits for your computing
platform (Mac, PC, or Unix) for viewers for the
most common file types. Then follow your
browser's instructions for assigning a viewer for
each file format you wish to view online. - You can't log on as an anonymous user What does
it mean?This message covers a multitude of sins.
Some ftp sites allow people who aren't members,
some don't. Others may allow nonmembers, but
limit the number of visitors. Another possibility
is that your browser doesn't support anonymous
ftp access. The way most browsers handle this is
to submit "anonymous" as the user ID and your
email address as the password. The America Online
browser is one of the few that don't do this.
What can you do about it?Either try again later
after the rush hour or enter your user ID and
password manually (using ftp software such as
WS-FTP). Remember your ID is anonymous and your
password is your (I hope for you bogus) email
address.
26Project 1Part II
- Building the Zen of awareness
27Cat burglars in the museum after dark
- Our Target
- http//gallica.bnf.fr/
"70000 documents numérisés, une navigation plus
intuitive, cette nouvelle version de Gallica
constitue la mise à jour la plus importante
depuis la création de ce serveur en octobre 1997.
28http//gallica.bnf.fr/
29Find Me In the Museum
30Become One With Your Environment
- Relax
- If you focus you are trying too hard and you will
miss what is happening. - Try for 15 minutes then take a break and do
something else while your mind explores the
problemlaterally - What kind of things is this browser loading?
31An Outside Linker?Can you find it?
32http//gallica.bnf.fr/Fonds_Mosaiques/
33Hints to Find Me
- 07720489
- Scripts can be used to acquire the target
34What to Submit
- Either the photo (.jpg)
- Or a short (900 word) essay on
- How you think the museum is organized (diagrams
would be helpful) - Why you were unable to find the picture
- Your thoughts on the outside linker where is
it why you cant find itwhat is used for - Other interesting observations you may have
collected
35Research
Research ?
36The Search
p or photograph
37The Search
Try a photo..
38Become One With the Screen
Catalog number
Whats this?
39This is where our picture is hiding
mediator.exe
Catalogue number
40Lets see what's coming over the wire
41What is mediator.exe?
42Lead Technologies Inc.
LEAD Technologies Inc. v1.01?
Mediator.jpeg
Beginning of picture bits
Normal JPEG header information
Beginning of picture bits
Normal JPEG file format
43LeadTools
Save the image
44mediator.exe
The LEADTOOLS Image Server is an ISAPI extension
for IIS 4 and above and is perfect for web
administrators that have images located at a
central location that need to be converted and
processed based on each clients specific needs.
45HEX Editor
- The tool of choice for looking at binary files
- The tool of choice for modifying unprotected
binary files - We can modify the executable so it does what we
want it to do without recompiling. - Common Features
- Search
- Dif
- Modify
46PE Format
Windows PE file format
47swf Format
SWF format (flash files)
signature
version
file size
48PDF Format
PDF format
49Demo cust.exe
What are the options Buy Reset the system
clock Change the source code Modify the binary
The shareware has expired. We can only order or
exit the program
50cust.exe
By changing the binary image we can Add new
functionality Eliminate existing
functionality Force changes in code flow
We have relied on the compiler to encrypt the
source. Is our program really secure?
51cust.exe
By using only a HEX editor we have added new
functionality that allows us to continue