Introduction
The Internet in recent years has a very significant increase in the number of users , number of connected computers and the amount of information that is available. The very nature of this network has the outset , a potential tool for professionals in the Information- documentation.
The advantages of the Internet are indeed numerous :
• relatively cheap,
• Provides an interface for simple navigation and unified
• a means of rapid and effective communication,
• place of publication "dynamic" , that anyone can publish documents and update them easily and quickly.
The network is trying to establish itself as a complementary source of information conventional services. But it is not enough to be on top of a mountain of books to suddenly become more intelligent. It should be able to find relevant information on the network and exploit knowledge , or the nature of the Internet means that it is scattered , not centralized.
To find this information , many research tools were developed tools we will try to present an overview (non- exhaustive ) and specificities ( Pros / Cons ... ) specific to each of them.
Then, we will focus a bit more on the search engines to give you a detailed description , including as regards their various methods of operation.
Finally, we try to show that the optimal operation of such tools is truly possible that from the moment their architecture has been specifically designed for it.
Definition
The term " search engine " is often wrongly used to describe any research tools.
Indeed, from the time when there is a possibility of research, we tend to use the generic term "engine" , yet it is vital to differentiate each of these tools , their characteristics making difficult to compare.
Presentation of the different tools
Search engines
Definition
A search engine is a tool that allows the user to search all pages "web" with a word or expression .
It uses , in general , software that periodically scan a portion of the static Web files in order to update a database containing indexing words all or part of the files ( pages) visited.
For example: AltaVista stores 350 million ( file ) , Lycos 340 million , only 2 million WebCrawler , etc. .
Operation
When the user enters a keyword in the search form , the engine will search for it in its base , that is to say in the content of Web pages saved .
The query syntax is very important, so the user must learn to use key words and language specific to each search engine .
Once identified, the "lot" of pages containing the term requested is classified in order of relevance ( up / frequency search terms ) , an algorithm based on some sorting criteria and specific to this action will do .
Examples:
• AltaVista ( http://www.altavista.com )
• Excite ( http://www.excite.com )
• Google ( http://www.google.com )
• Lycos ( http://www.lycos.com )
• WebCrawler ( http://webcrawler.com )
The various categories
The general semantic search engines
The first engines were characterized by a full-text search function exclusively focused on unstructured data - that is to say those in the body of documents.
Thereafter , technological developments have extended reading structured data ( "metadata" , ie META tags in HTML) .
Many other file formats (such as Word documents) adopt the same type of process - with a different mode of treatment.
To further refine the search , said semantic engines then made their appearance , their main objective is to integrate the meaning of language in the research process . For this , they rely on dictionaries of concepts (or thesauri ) specializing in the treatment of specific themes .
This method allows to provide relevant answers on specialized areas for users who are not necessarily.
Main constraint: it is necessary to perform a semantic upstream work, and refine thereafter based on user feedback .
products:
• Hummingbird (EIP )
• Verity ( Portal One )
• Autonomy ( KnowledgeServer )
• Sinequa ( Intuition )
• LexiQuest ( LexiQuest ) .
engines multi- dimensional search
Basically, their operation is based on that OLAP cubes ( OnLine Analytical Processing) - used including data warehouses or data warehouses in decision-making systems.
This method requires a parameter driven ( case by case ) , refines the categorization of documents and processes cross applications .
This is probably the most advanced products.
Note: coupled with semantic search , this type of engine can become powerful.
products:
• Instranet ( Instranet 2000).
The vertical search engines
This is a relatively new category on the market.
It covers solutions for very specific business issues , such as the Web, human resources management or trades yesterday.
products:
• Arisem ( OpenPortal4U )
• Auracom ( Auraweb )
• Alogic ( Alcalimm )
• Atomz ( Atomz ) .
Advantages
• Relatively comprehensive in their field,
• Indexes each word in a Web page ,
• Propose search options sometimes useful .
Disadvantages
• Number of responses very often (too) high,
• Many documents lack of relevance to the query,
• Requires some experience of use.
The search directories
Definition
Directory (also called directory) is a research tool that identifies a number of sites through records ( record) Descriptive generally comprising : a title, address ( URL ) and a brief description of 15 25 words maximum (established by the publisher of the book body or made by the site editor ) .
This catalog uses a database (usually filled manually ) describing a selection of sites indexed using a tree list of topics (sometimes called categories or topics ) .
Operation
The interrogator may make its search using the words contained in the site name , classification, optionally, a description and / or moving in the topic tree .
When a keyword is entered, the phone will search its listings on occurrences of the term , not in the content of the pages (the main difference with the search engines) .
The output is a list of sites classified thematically according to the word or words found in the research subjects or elsewhere ( name, site description, etc . ) .
Advantages:
• The operation is relatively simple ,
• Must the most important sites are referenced ,
• The user is guided in his research : he became successively to the classification more accurate .
Disadvantages:
• Non- exhaustive , only a small part of the network is referenced.
• The rapid evolution of the Internet requires that the content of the items to be updated regularly (need editorial team ) ... or it is not always the case.
Examples:
• AOL Search ( http://www.recherche.aol.fr )
• It is found ( http://www.ctrouve.com )
• Francité ( http://francite.com )
• The yellow pages ( http://www.pagesjaunes.fr )
• MicroSoft Networks ( http://www.msn.com )
• Netscape Netcenter ( http://www.netscape.fr )
• Nomad ( http://www.nomade.fr )
• UREC ( http://www.urec.fr )
• Yahoo France ( http://www.yahoo.fr ) .
The meta-search engines
definition
To provide more opportunities for surfers , appeared meta- engines.
These " great drivers" allow from a single query , perform simultaneously search multiple search engines and directories sites and view the results (using different research instruments used ) in a single page.
The interviewer does not have to worry about the position of the search words (title, description etc ... ) .
Advantages
• Efficiency growing ,
• cumulative power of engines .
Disadvantages
• Length of the research,
• Fancy results ,
• Problem adaptation query requests (related to the heterogeneity of the syntax used by different motors)
• Efficiency face lesser search engines specialized in complex queries.
Note: The first two disadvantages are not really relevant.
Examples:
• Debriefing ( http://www.debriefing.com )
• MetaCrawler ( http://www.metacrawler.com )
• Savvy Search ( http://www.savvy.com )
• Search.com ( http://www.search.com )
• The Big Hub ( http://www.thebighub.com ) .
Portals
definition
A portal is a " Website" public somehow playing the role of gateway to the different Internet services .
There is a set of resources and services specific to general or a domain ( search tool , directory, free e-mail service ...) to a defined set of users ( public, member of a profession or industry ... ) . The idea is to offer the best possible services with the aim of making himself indispensable !
These services typically include one or more search tools on the Internet ( engines and / or directories) clean or not the portal , but also information regularly updated and, where appropriate, other types of commercial services or not.
Examples:
• AltaVista ( http://www.av.com )
• Yahoo! ( http://www.yahoo.com )
Services
• Research Tools ,
• Information services (news, finance , weather, etc. . )
• Communication tools (E -mail , mailing lists , newsgroups )
• Thematic Access to Information ,
• Tools consumption (online sales , advertising, etc. . )
• Customizable Services
• From content.
Advantages
• Navigation and search information on the "Web" facilitated
• Tree intuitive for the user ,
• Value added services .
Disadvantages
• Such sites can lead the users to a partition , they tend to be limited to its content ...
New Strategies
Public major portals have wide traffic, and allow companies who are advertising direct access to millions of consumers.
The cost of acquisition and retention of a client is so diminished.
These new sites thus represent enormous challenges .
Methods of financing a portal , for example:
• Banner advertising ,
• Partners
• Sell online.
specialized portals (" niches ") or professional
It may be thematic portals bringing together virtual communities around specific interests , or even regional portals ( or local) facilitating the identification of own resources in a region.
The identification of such services is usually a great help to detect relevant sites in relation to the research topic .
Examples:
• Dentist ( http://www.visioweb.com )
• Comics ( http://www.abcbd.com )
• Education and teaching ( http://www.education.fr )
• Culinary Art ( http://www.cuisinons.com )
rings ( rings , chains, virtual communities ... )
definition
Many sites on the Internet dealing with subjects closer . Often ways to address the issues are more complementary than competitive . Webmasters also often include links to sites similar to theirs. This forms a community, decentralized to the extreme , yet bound by a common theme.
A ring is a combination of Internet , which aims to promote and publicize the members of the community . It consists of a hub site identifying and describing all the investments , but also by a small logo that appears on the pages of members sites.
The ring is clearly distinguishable site link because membership is voluntary . It also differs from search engine users often drown in the amount of hits . The ring selects and controls the quality of the proposed sites .
Rings or engines?
When a visitor uses a search engine to obtain information , it must sort out the responses that are sent to consult with more or less likely the sites which appear to him to answer his query, eliminate those that hardly have anything or that no longer exist ... short, usually without too much hassle to get a comprehensive list on the topic searched .
The rings are one response to this concern. Being managed manually, they are subject to a selection of candidates and ensure quality and reliability of information. Of course you have to enter the ring corresponding to its research.
Two possibilities for this :
• By means of this ring on a site of which one obtains the address accurately or random,
• Either the site " WebRing " specialized in this type of information.
With or without " WebRing " ?
When sites want to form a group , they exchange their addresses. But this mechanism poses several problems, including updating of all sites in case of new membership and the lack of official recognition vis- à-vis the rest of the Net.
" WebRing " creates a structure referenced with tools.
There are tens of thousands of referenced in " WebRing " rings ( some of which include several hundred sites ) but very few French rings ( this is changing ) .
Thus, 10/02/98 , there was 45 830 active rings ( at least one access per day ) of several hundred to thousands of sites.
Webmaster and ringmestre ?
Initially, a webmaster would group sites dealing with the same topic: tourism, personal pages , sport, literature , geographical or cultural entity , etc ...
So he created a site at " WebRing " , give it a name (ring ID ) registered its website and it is identified in the ring by a number ( Site ID) .
In parallel, he created a logo and a table that allow navigation between the prospective member sites .
To become operational and lively ring , must join the other sites and are validated by the ringmestre , coordinator of the group. It belongs to ringmestre validate sites waiting in view of their content should follow the spirit of the project.
Example
• On the Lyon region ( http://nav.webring.yahoo.com/hub?ring=autour&list )
• On the snowboard England ( http://s.webring.com/hub?ring=ukboarders )
• On the Sting ( http://nav.webring.yahoo.com/hub?ring=fieldsofsting&list )
The " invisible web "
Definition
The " invisible web " (or " hidden Web ") is not strictly a research tool, it is, in fact, part of the Web containing all documents (texts, videos , pictures. ..) were not indexed by traditional search engines (engines, directories ... ) .
Indeed, it is relatively easy to find the address of a site using conventional search tools , however it is almost impossible to perform a query on the contents of its catalog ( databases , if it has , accessible only via their own search engine) .
Origins of the " non- indexing"
Why such documents were not indexed are many and varied :
• Indexing concerns only music files ( MP3 , midi ) , images (gif, jpg ... ) , and documents in HTML ( and soon those in XML format). Impossible to find a document ( word processor , spreadsheet ... ) , animation (Flash) , or PDF, PostScript file ....
• Lack of access to data due to their dynamic (dynamic pages , databases )
• The robot in charge of research has been deliberately restrained according to certain criteria (levels of width and depth , followed by external links, page size )
• Some documents were simply "banned SEO " by their authors or by using meta tags or using files (" Robots.txt " )
• The algorithm used by the engine did not consider as relevant information contained in the page , relative to the application,
• The site is too new or has not taken the step to make reference to be present in the database tools .
Research Tools
Currently , it is estimated that the size of the " invisible Web " is approximately 30-35% of total content ... fortunately, there is a large amount of tools to allow us to discover this new world :
• Adobe PDF Search ( http://searchpdf.adobe.com ) : more than a million documents are available at this address (all in PDF format !) .
• All- One- Search ( http://www.allonesearch.com/ ) : the oldest tools of research databases: effective, but difficult to understand.
• AlphaSearch ( http://www.calvin.edu/library/searreso/internet/as/ ) another tool well done, but not to search.
• National Library of France ( http://www.bnf.fr/web-bnf/liens/index.htm ) : collection of bookmarks commented .
• Electronic Access Journal ( http://www.coalliance.org/ejournal ) includes information resources of several American universities.
• fossick ( http://www.fossick.com ) : nearly 1,400 links to the best specialized search tools .
• InvisibleWeb ( http://www.invisibleweb.com ) service search more than 10,000 archival data bases , catalogs ... all this organized thematically .
• searchability ( http://www.searchability.com ) links to many sites about the " invisible web ".
• The Invisible Web Catalog ( http://www.invisibleweb.com ) : "the" place to start research in this area.
• Webcats ( http://www.lights.com/webcats/ ) offers a selection of library catalogs worldwide .
Other tools
There is such a diversity search tool today , it would be completely unrealistic to try to identify all .
Some examples of these categories hard " classifiable "
• Search classified by geographical area addresses
FinderSeeker ( http://www.finderseeker.com )
Excite Travel ( http://www.excite.com/travel )
The limitations of these tools
Use one of the services mentioned above is not sufficient .
Indeed, each tool has its strong points , although sometimes they are developed on a common technology.
No tool is perfect , always use several in parallel.
For example , a first step would be to use a directory like Yahoo or Looksmart , then, depending on the results, it would be enough to move to engines or meta- engines.
Technology research tools
General Operation
The objective of research tools is to identify the sites, browse and store their content (usually in the form of index) to facilitate access to their pages.
These tools based almost all on the same architecture composed of three separate modules: software robot explorer (" spider ") , an indexing system , and search software (" searcher ") :
Each module has a clearly defined role :
• the "spider" surfs like a browser, retrieving and analyzing the maximum information (URL , title, relevant words ... etc. . ) From pages they visit.
• the indexing system is then responsible for storing and classifying this information appropriately in a database.
• the " searcher " , meanwhile , is responsible to find in this basis, the documents most relevant to the query that will be submitted.
The robot explorer software (" spider ")
The spider (also called crawler or bot ) is a software robot, a kind of " nosy " which explores autonomously meandering "Web" .
Its role is crucial , because it is part of his power which depend of the motor.
Its parameters must be careful because it is not uncommon for competitors engines use the same spider to their research ... finesse respective settings ( eg selective / massive exploration sites ) and to make a difference in their effectiveness .
These software "explorers" uses algorithms to periodically review millions of pages ( taking care not to turn loop ) and thus constitute a database of previously visited sites and documents .
Some spiders invest the full page , while others are just simply the title, summary prepared by the author of the page , and related parts called meta -tags ( meta tags ) .
These robots then identify hypertext links on the page ( checking that they have not already been visited) , to achieve the pages pointed . They thus rapidly traverse the entire site and the sites linked to it ... and so on.
This route can be done in width ( breath first search ) , depth (depth first search ) or even in a mixed ( width up to a certain point and depth).
Depending on the rate of rotation of robots on the network, a research tool therefore has a huge data base URL , allowing it to give a good approximation of the entire Web at any given time .
But be careful , because although spiders can " swallow " nearly ten million pages per day , only the most visited sites will be regularly updated.
Note: Code of Ethics of the Internet ( netiquette ) defines instructions that allow site administrators to define, through a file called " robots.txt " areas where robots can perform the work, and those , private they should not be cataloged.
The indexing system
The spider then returns the information collected for indexing engine they are analyzed . It then constructs an index of words encountered ( and addresses of the relevant pages ) and stores all in one database, generally speaking automatic indexing .
Note: the complete renewal of the index (called " refresh time ") may extend over a period of between one week and one month depending on the engines.
Data extraction
Electronic media are rarely texts " kilometer " but office documents , marked , or more generally, rich text (typography , presentation, layout ... ) .
These files , by their peculiar structure ( their own creation tool ) can not be indexed after conversion : the indexing engine that accepts , indeed, that the data in the form of continuous text.
Extraction of the texts before indexing tools are needed.
Some indexing engines use these techniques to classify documents in their "native" formats , not duplicate , while others prefer to transcode in a simplified format and then work only on the "poor" text.
indexing techniques
At first, only the titles of documents were indexed, but this solution has quickly shown its limits , firstly , because the title of a document does not always reflect its content, and secondly , because it generates many redundancies.
Then, the solution adopted was to store most all words of the title of the first paragraph.
Today , this method has abandoned this paragraph to focus on meta- data (or meta tags ) that allow the author to specify the words with which he wants to index the contents of its pages.
But these tags are still too few common ( indexing and cataloging became a full-time job ... ) , or poorly used ( there are many who cheat by overloading this tag ...) to be effectively used by robots. However, this idea is a solution for the future.
Note: this limit Altavista Indexing at Source 1024 characters ( 1KB , the characters are single-byte or eight bits).
Examples of indexing software :
• Altavista,
• Excite,
• Fulcrum
• Infoseek ,
• IntelliServ ( Verity )
• PLsearch ,
• Livelink ( Opentext )
• RetriviealWare (Excalibur ) .
The search module (" searcher ")
Appearance and operation
The searcher ( search engine ) is the visible part of the iceberg, the front of the user.
Showcase site This page mark ( home page) is updated regularly and usually decorated with contextual ads (which vary depending on the wording of the request made ) .
Through this GUI, the visitor can ask the question , select the available options, and click a button (or the Enter key on his computer ) to start the search. CGI (Common Gateway Interface ) script will then call the indexing system , so that it runs a query against the database containing the data collected on the Web.
The different types of question:
• The Boolean query : use logical operators (AND, OR, NOT , NEAR, etc.). . Each engine or directory does not offer the same number , and the formulation of the query to use also varies from one to another. A good knowledge of these tools becomes necessary to conduct relevant research.
• Search through list of words : the user query is transcribed by a Boolean expression system according to a pre- established scheme ( implicit AND , OR implied, etc.).
• The natural language search : This type of research requires indexing and "intelligent" search implementing language processing modules developed. Internet no system (yet ) as a processing power.
The technical research base
Classical Literature Search
All information search tool operates from index files , heart of the technology used , representing textual information stored in electronic documents .
Classical literature search using index type keywords , where index entries are standardized words or phrases from a controlled language (such as a thesaurus , for example) .
This is a proven technique that actually allows you to find some way of indexed information. Research is accurate and confronts an equation Boolean search content index files : it is perfectly effective and leaves no room , computer terms, any uncertainty as to whether the documents found correspond to the search equation .
However, it requires significant resources to index documents and quality of research results depends directly on the quality of indexing.
If electronic documents are unstructured text information , index entries can be much more numerous and little or no standardized. To be effective, the research can not be limited to a simple analysis of the presence / absence of words, the risk of providing rates of silence and excessive noise. This is actually on these issues as technologies have evolved significantly in recent years.
4.4.3.2 Text Search
Unlike a structured database information retrieval, text search is to find documents that " look " as much as possible the question. It is impossible to obtain all relevant and only those because the language is ambiguous documents. It is therefore to optimize the search results in terms of noise than silence, that is to say, analyzing the relevance of a document with respect to the question. And this is the difficulty. Especially it is not enough to examine a single database of information, but many scattered over a network, and it also becomes necessary to limit the replies submitted to the most relevant only.
The techniques used are varied, the main ones being language techniques, statistics and fuzzy search . They can be used alone or in combination .
Linguistic techniques
They can work in two ways: indexing and search .
• A indexing, the goal is to reduce the number of index entries, by putting in "canonical" form phrases and words, and will reduce to an input a word and its inflected forms , verb and its conjugations , but also , by a morphological and syntactic analysis of the text , the idiomatic expressions . For this, we will use a dictionary of the language, possibly customized based upon the field covered by the document corpus.
• A search will analyze the question in natural language to extract the most significant words, disambiguate the issue via an interactive dialogue with the user and generate a normalized query to the search engine .
The process is as follows :
Electronic dictionaries are not simple list of words or expressions, but real semantic networks that bind expressions by special relations and include definitions of terminology .
They are also used in the analysis of relevance to weight the results based on the semantic proximity of documents and the terms of the question.
The software package available today includes engines , or include in their basic salaries of language functions for indexing and search , or use third-party tools primarily linguistic research only.
Statistical techniques
Three broad categories of statistical techniques are available today and have different purposes .
• The first is to order a search result in order of decreasing relevance and will therefore calculate the weight of a document in relation to the question. For this, we take into account criteria such as the frequency of words in the document , the discriminating power of the words of the question depending on the rarity of the terms in the index, the distance between words and the question words found in the document, the keyword density of the matter in the document , etc. . Obviously, if the indexing and search engine uses linguistic techniques, these statistical techniques can be applied not to the words but the index entries ( standardized concepts ) to obtain a relevance ranking even better.
• A second category of statistical techniques will essentially be to assist the user in his quest . For example, if the user writes a Boolean search , we will indicate whether it wants a strict search or prefers an approximate result. The search mechanism will then analyze the different lots intermediate results - for each criterion - and mainly based on the criteria of word frequencies , may extend the strict batch result. Similarly , the similarity search is to build an application that allows to retrieve texts as similar as possible to a portion of text selected by the user. Again, this is via a weighting of criteria for research and statistical analysis that the results will be sorted and presented to the user in an ordered form .
• The third technique , called automatic classification, grouping search results by homogeneous and identify for each class, the words or phrases that best describe the class aims to classes . Thus, the user can quickly identify the classes that interest others. This is a statistical technique widely used in other areas (economic or demographic surveys , for example) . Refining step by step the question, the user can identify the closest expectations documents.
Fuzzy search
Fuzzy research aims to enable text search allowing errors either in the formulation of research is in the content of the texts. Typically, this is a technique widely used for research on texts that result from an optical character recognition (OCR) , in which avatars are known. Indeed, OCR generates two types of errors: unrecognized and poorly recognized those characters. In both cases , the words are " scrambled " by characters or parasites are approximate , since, for example , the OCR will recognized the term " irninente " instead of "imminent ."
The methods used are numerous and are based either on algorithmic techniques or techniques on neural networks.
One or more techniques?
Different research techniques , or more precisely to improve the relevance, are not competing but complementary and technological evolution of market drivers and the number of active partnerships show that if the problem is not simple the solution is probably the combination of available techniques.
One issue is particularly difficult management of multilingualism or at least information databases in multiple languages. If language techniques ( and parsing rules ) are obviously dependent on the language , it is not the same with other techniques ( statistical or fuzzy search ) . Indeed, if the principles of weighting and calculation may be slightly different from one language to another , the same criteria are applicable everywhere. This is probably one of the reasons why these techniques now have the wind in their sails.
Presentation of results
It is displayed as a dynamically generated web page , generally incorporating the responses in a list ( more or less detailed ... ) or as a response number (optional, as inconvenient) , ranked by of relevance to the query.
Whether to comply with the basic layout of the document , it will be necessary for indexing to locate the position of each word in the original electronic documents in order to highlight ( on - lineage) engine during the display results.
Architecture research tools
It is very difficult to find documentation on the internal architecture of search engines, each manufacturer wanting to preserve its technological secrets.
That is why , this part will focus mainly around examples of architectures existing engines (and who kindly give us some of their characteristics ) .
General Structure
Search engines use bases ( one or more ) which are indexed data that has been previously retrieved by the search tools on their exploration of the " Web ".
These databases can be either stored on one or more disks (or on one or more servers ) to parallelize up treatments that are applied to them.
The process of finding and indexing may also be particularly " gourmants " in terms of CPU resources , so their treatment can , in turn , be parallelized using grouped in cluster servers for example .
Example : the search DILIB
DILIB (available at http://www.loria.fr/projets/DILIB/dilib-0.2/DOC/index.html ) is a platform for Document Engineering and Information Science and Technology , containing among other things, some research based on a similar engine ( in architecture ) to those that can meet the " Web ".
Structure of the database
The effectiveness of a search engine depends largely on the speed with which the latter will be able to retrieve information from the contents.
That is why , as virtually all engines are organized around inverted files .
In fact, such files contain as many records as different descriptors , and thus allow to find a list of primary documents ( containing a descriptor) from a simple registration.
From a file of bibliographic records ( simple sequence of records) , generating an Information Retrieval System (IRS) product:
• Direct file ( bibliographie. ..) with a direct access device,
• Inverse files: one for each indexed field ,
• Various files: parameters stopwords ...
In the case of search engines " Web " records represent all data collected by the robot during his exploration of the Net.
Organization of hierarchical data
DILIB helps build inverted files on these structures HFD (Hierarchical File organization for Documentation) :
Principle
Key combination <-> Unix location
Each record is assigned a key 6 digits 000000 -> 999999 , key which is then divided into three areas of 2 figures : " dd ff rr ."
Entries are grouped by group 100 .
Each group is stored in a Unix file name is of the form: " ff.df ."
example:
123456 The notice will be the 56th in a file named 34.df.
Files " . Df " are in turn grouped in sets of 100 .
Each group is stored in a folder ( directory) Unix , his name is of the form: " dd.dd "
example:
123456 The notice will be the 56th in a file named 34.df stored in the directory 12.dd.
These directories are finally combined in a root directory whose name ends with. Hfd
Generalization
In fact HFD structures can manage larger ensembles playing on :
• The number of index levels
• The nature of the keys ( numeric, alphanumeric )
• The length of the key .
inverted files
Principle
This device improves the performance of the search engine .
The inverted files are created using data from direct file eg keywords or information about registration ( authors, title, ... ) .
This inverted file can contain a list of values arranged in order , a "fields" defined (keywords , author , ... ), and thus allows rapid access to the corresponding record in the direct file . ( A file created by reverse fields classify).
Example of direct file :
Registration No. Author Title Keywords
000000 Herge Tintin in the Congo Tintin, Snowy , dog
000001 Tintin in America Herge Tintin, Snowy , horse, dog
000003 The Dalton Morris, Goscinny Lucky Luke , horse
000004 Asterix the Gaul Goscinny, Uderzo Asterix , Idefix , dog
Example of inverted file ( keywords) :
Asterix 000004
Horse 000002 , 000003
Dog 000001 , 000002 , 000004
Idefix 000004
Lucky Luke 000003
Snowy 000001 , 000002
Tintin 000001 , 000002
Example of inverted file ( Authors ) :
Goscinny 000003 , 000004
Herge 000001 , 000002
Morris 000004
Uderzo 000004
The index tables
Principle
This arrangement also improves the engine performance .
The index tables are used to find a given keyword in a reverse file
An example : the index table of the authors of XML Server
Digital key symbolic key
ABRAMS (M.) 000000
DEROSE ( S ) 000100
JAAKKOLA (J.) 000200
MURTAGH (F. ) 000300
Suciu (D.) 000400
The index tables are used to associate a key item given the relative address of the latter, a sortable index or not.
The density of the index is the ratio between the number of the index key and the number of articles.
The parameter files
These are files that contain :
• Presentation Settings ,
• IT or technical parameters ,
• " language and literature " Settings.
Example search engine : ALTAVISTA
The " AltaVista Search Engine 3.0" engine is available on the website: http://solutions.altavista.com/
Minimum System Requirements
• An Intel Pentium 300 Mhz or higher,
• 256 MB RAM,
• at least 5 GB of free disk space after installation.
Supported Platforms
• Windows 2000 or Windows NT 4.0 SP6,
• Solaris 2.6, 7 or 8,
• Red Hat Linux 6.0 , 6.1, and 6.2 ,
• AIX Version 4.3 ( or later )
• Compaq Tru64 Unix 4.0D (or newer ) .
Business users
Many companies use the Altavista search engine to offer on their website:
• Abebooks.com ,
• Amazon
• Buy.com ,
• and many others ...
General architecture of the engine
As described above, websites are driven by robots and data are extracted (data sources). These data are stored in a database to be subsequently converted ( converters , feeder ) to be indexed ( index ), and finally stored in the base index (index).
The Altavista engine offers very fast web server named " mhttpd " , but it is still possible to use another as apache or IIS (Internet Information Server ) for example.
This web server can perform searches based on index and to send responses to the user.
Architecture servers
All engine components Altavista can be deployed on different servers in order to optimize the performance of individual components.
Solution: 2 servers
In this example, information retrieval ( machine A ) part is separated from the indexing and search (Machine B) part .
Solution: Servers in parallel
With increasing access to the Internet, the number of research can greatly increase , hence the need to add front-end systems to better distribute the load on the servers.
In this example a "smart routing " program will direct traffic on one server (Machine A or C ) after charging it, in order to get better response time.
Here are the search server base index and the base itself is redundant , recovery and data creation part from the "Web" has not been duplicated.
Solution: independent servers
Stains Data Recovery over the Internet can be deployed across multiple servers.
It is even possible to mix several types of platforms, and concentrate maximum power where the system is most needed , and that , depending on the type of load applied to the search .
conclusion
Today, one thing is clear : Internet is " exploding " growth is exponential, and can not seem to calm down ...
This is also not without its problems :
the significant increase in the amount of data available on the network even eventually lead to new needs in terms of information retrieval .
Alas, even if these tools have grown steadily the last years , there are still they still have a long way to go before reaching the performance offered by other types of engines ... Indeed , obtaining responses ( enough ) Relevant sometimes requires the combined use of several research tools.
However, we can be optimistic as we begin to see happen technologies using artificial intelligence , which seem particularly promising ( intelligent agents , natural language search ) .
glossary
Data base
this is a set of alphanumeric data on an identical substrate and interconnected in a coherent structure .
Internet
Vast collection of computer networks that exchange information through a suite of network protocols called TCP / IP.
meta Tag
It is a structure placed in the header HTML web pages, providing information that is not visible to browsers, but are used by the search engine for indexing.
The most common meta tags ( and more useful for Search Engines ) are KEYWORDS ( keywords) and DESCRIPTION .
KEYWORD meta -tag allows the author to emphasize the importance of certain words and phrases used in the page. Some search engines will consider this information - others ignore .
DESCRIPTION meta -tag allows the author to control the text displayed when the page appears in the results of a search. Some search engines may ignore this information.
protocol
Definition of the elements showing how the dialogue between different computers on a network.
spamming
Creating or modifying a document with intent to deceive a catalog or electronic filing system . Any technique that aims to increase the potential position of a site at the expense of the quality of the database Search Engine can also be seen as spamming (also known as spamdexing or spoofing ) .
Spamming is most commonly used to refer to sending email unsolicited bulk . The use of this word in the Search Engines is derived from this term.
TCP / IP
TCP / IP is a set of protocols: the transmission control protocol (TCP ) and Internet Protocol (IP) that allow different types of computers to communicate with each other. The Internet is based on this protocol suite .
Web
The web means more broadly the entire network of sites where you can navigate with your browser.
The Internet in recent years has a very significant increase in the number of users , number of connected computers and the amount of information that is available. The very nature of this network has the outset , a potential tool for professionals in the Information- documentation.
The advantages of the Internet are indeed numerous :
• relatively cheap,
• Provides an interface for simple navigation and unified
• a means of rapid and effective communication,
• place of publication "dynamic" , that anyone can publish documents and update them easily and quickly.
The network is trying to establish itself as a complementary source of information conventional services. But it is not enough to be on top of a mountain of books to suddenly become more intelligent. It should be able to find relevant information on the network and exploit knowledge , or the nature of the Internet means that it is scattered , not centralized.
To find this information , many research tools were developed tools we will try to present an overview (non- exhaustive ) and specificities ( Pros / Cons ... ) specific to each of them.
Then, we will focus a bit more on the search engines to give you a detailed description , including as regards their various methods of operation.
Finally, we try to show that the optimal operation of such tools is truly possible that from the moment their architecture has been specifically designed for it.
Definition
The term " search engine " is often wrongly used to describe any research tools.
Indeed, from the time when there is a possibility of research, we tend to use the generic term "engine" , yet it is vital to differentiate each of these tools , their characteristics making difficult to compare.
Presentation of the different tools
Search engines
Definition
A search engine is a tool that allows the user to search all pages "web" with a word or expression .
It uses , in general , software that periodically scan a portion of the static Web files in order to update a database containing indexing words all or part of the files ( pages) visited.
For example: AltaVista stores 350 million ( file ) , Lycos 340 million , only 2 million WebCrawler , etc. .
Operation
When the user enters a keyword in the search form , the engine will search for it in its base , that is to say in the content of Web pages saved .
The query syntax is very important, so the user must learn to use key words and language specific to each search engine .
Once identified, the "lot" of pages containing the term requested is classified in order of relevance ( up / frequency search terms ) , an algorithm based on some sorting criteria and specific to this action will do .
Examples:
• AltaVista ( http://www.altavista.com )
• Excite ( http://www.excite.com )
• Google ( http://www.google.com )
• Lycos ( http://www.lycos.com )
• WebCrawler ( http://webcrawler.com )
The various categories
The general semantic search engines
The first engines were characterized by a full-text search function exclusively focused on unstructured data - that is to say those in the body of documents.
Thereafter , technological developments have extended reading structured data ( "metadata" , ie META tags in HTML) .
Many other file formats (such as Word documents) adopt the same type of process - with a different mode of treatment.
To further refine the search , said semantic engines then made their appearance , their main objective is to integrate the meaning of language in the research process . For this , they rely on dictionaries of concepts (or thesauri ) specializing in the treatment of specific themes .
This method allows to provide relevant answers on specialized areas for users who are not necessarily.
Main constraint: it is necessary to perform a semantic upstream work, and refine thereafter based on user feedback .
products:
• Hummingbird (EIP )
• Verity ( Portal One )
• Autonomy ( KnowledgeServer )
• Sinequa ( Intuition )
• LexiQuest ( LexiQuest ) .
engines multi- dimensional search
Basically, their operation is based on that OLAP cubes ( OnLine Analytical Processing) - used including data warehouses or data warehouses in decision-making systems.
This method requires a parameter driven ( case by case ) , refines the categorization of documents and processes cross applications .
This is probably the most advanced products.
Note: coupled with semantic search , this type of engine can become powerful.
products:
• Instranet ( Instranet 2000).
The vertical search engines
This is a relatively new category on the market.
It covers solutions for very specific business issues , such as the Web, human resources management or trades yesterday.
products:
• Arisem ( OpenPortal4U )
• Auracom ( Auraweb )
• Alogic ( Alcalimm )
• Atomz ( Atomz ) .
Advantages
• Relatively comprehensive in their field,
• Indexes each word in a Web page ,
• Propose search options sometimes useful .
Disadvantages
• Number of responses very often (too) high,
• Many documents lack of relevance to the query,
• Requires some experience of use.
The search directories
Definition
Directory (also called directory) is a research tool that identifies a number of sites through records ( record) Descriptive generally comprising : a title, address ( URL ) and a brief description of 15 25 words maximum (established by the publisher of the book body or made by the site editor ) .
This catalog uses a database (usually filled manually ) describing a selection of sites indexed using a tree list of topics (sometimes called categories or topics ) .
Operation
The interrogator may make its search using the words contained in the site name , classification, optionally, a description and / or moving in the topic tree .
When a keyword is entered, the phone will search its listings on occurrences of the term , not in the content of the pages (the main difference with the search engines) .
The output is a list of sites classified thematically according to the word or words found in the research subjects or elsewhere ( name, site description, etc . ) .
Advantages:
• The operation is relatively simple ,
• Must the most important sites are referenced ,
• The user is guided in his research : he became successively to the classification more accurate .
Disadvantages:
• Non- exhaustive , only a small part of the network is referenced.
• The rapid evolution of the Internet requires that the content of the items to be updated regularly (need editorial team ) ... or it is not always the case.
Examples:
• AOL Search ( http://www.recherche.aol.fr )
• It is found ( http://www.ctrouve.com )
• Francité ( http://francite.com )
• The yellow pages ( http://www.pagesjaunes.fr )
• MicroSoft Networks ( http://www.msn.com )
• Netscape Netcenter ( http://www.netscape.fr )
• Nomad ( http://www.nomade.fr )
• UREC ( http://www.urec.fr )
• Yahoo France ( http://www.yahoo.fr ) .
The meta-search engines
definition
To provide more opportunities for surfers , appeared meta- engines.
These " great drivers" allow from a single query , perform simultaneously search multiple search engines and directories sites and view the results (using different research instruments used ) in a single page.
The interviewer does not have to worry about the position of the search words (title, description etc ... ) .
Advantages
• Efficiency growing ,
• cumulative power of engines .
Disadvantages
• Length of the research,
• Fancy results ,
• Problem adaptation query requests (related to the heterogeneity of the syntax used by different motors)
• Efficiency face lesser search engines specialized in complex queries.
Note: The first two disadvantages are not really relevant.
Examples:
• Debriefing ( http://www.debriefing.com )
• MetaCrawler ( http://www.metacrawler.com )
• Savvy Search ( http://www.savvy.com )
• Search.com ( http://www.search.com )
• The Big Hub ( http://www.thebighub.com ) .
Portals
definition
A portal is a " Website" public somehow playing the role of gateway to the different Internet services .
There is a set of resources and services specific to general or a domain ( search tool , directory, free e-mail service ...) to a defined set of users ( public, member of a profession or industry ... ) . The idea is to offer the best possible services with the aim of making himself indispensable !
These services typically include one or more search tools on the Internet ( engines and / or directories) clean or not the portal , but also information regularly updated and, where appropriate, other types of commercial services or not.
Examples:
• AltaVista ( http://www.av.com )
• Yahoo! ( http://www.yahoo.com )
Services
• Research Tools ,
• Information services (news, finance , weather, etc. . )
• Communication tools (E -mail , mailing lists , newsgroups )
• Thematic Access to Information ,
• Tools consumption (online sales , advertising, etc. . )
• Customizable Services
• From content.
Advantages
• Navigation and search information on the "Web" facilitated
• Tree intuitive for the user ,
• Value added services .
Disadvantages
• Such sites can lead the users to a partition , they tend to be limited to its content ...
New Strategies
Public major portals have wide traffic, and allow companies who are advertising direct access to millions of consumers.
The cost of acquisition and retention of a client is so diminished.
These new sites thus represent enormous challenges .
Methods of financing a portal , for example:
• Banner advertising ,
• Partners
• Sell online.
specialized portals (" niches ") or professional
It may be thematic portals bringing together virtual communities around specific interests , or even regional portals ( or local) facilitating the identification of own resources in a region.
The identification of such services is usually a great help to detect relevant sites in relation to the research topic .
Examples:
• Dentist ( http://www.visioweb.com )
• Comics ( http://www.abcbd.com )
• Education and teaching ( http://www.education.fr )
• Culinary Art ( http://www.cuisinons.com )
rings ( rings , chains, virtual communities ... )
definition
Many sites on the Internet dealing with subjects closer . Often ways to address the issues are more complementary than competitive . Webmasters also often include links to sites similar to theirs. This forms a community, decentralized to the extreme , yet bound by a common theme.
A ring is a combination of Internet , which aims to promote and publicize the members of the community . It consists of a hub site identifying and describing all the investments , but also by a small logo that appears on the pages of members sites.
The ring is clearly distinguishable site link because membership is voluntary . It also differs from search engine users often drown in the amount of hits . The ring selects and controls the quality of the proposed sites .
Rings or engines?
When a visitor uses a search engine to obtain information , it must sort out the responses that are sent to consult with more or less likely the sites which appear to him to answer his query, eliminate those that hardly have anything or that no longer exist ... short, usually without too much hassle to get a comprehensive list on the topic searched .
The rings are one response to this concern. Being managed manually, they are subject to a selection of candidates and ensure quality and reliability of information. Of course you have to enter the ring corresponding to its research.
Two possibilities for this :
• By means of this ring on a site of which one obtains the address accurately or random,
• Either the site " WebRing " specialized in this type of information.
With or without " WebRing " ?
When sites want to form a group , they exchange their addresses. But this mechanism poses several problems, including updating of all sites in case of new membership and the lack of official recognition vis- à-vis the rest of the Net.
" WebRing " creates a structure referenced with tools.
There are tens of thousands of referenced in " WebRing " rings ( some of which include several hundred sites ) but very few French rings ( this is changing ) .
Thus, 10/02/98 , there was 45 830 active rings ( at least one access per day ) of several hundred to thousands of sites.
Webmaster and ringmestre ?
Initially, a webmaster would group sites dealing with the same topic: tourism, personal pages , sport, literature , geographical or cultural entity , etc ...
So he created a site at " WebRing " , give it a name (ring ID ) registered its website and it is identified in the ring by a number ( Site ID) .
In parallel, he created a logo and a table that allow navigation between the prospective member sites .
To become operational and lively ring , must join the other sites and are validated by the ringmestre , coordinator of the group. It belongs to ringmestre validate sites waiting in view of their content should follow the spirit of the project.
Example
• On the Lyon region ( http://nav.webring.yahoo.com/hub?ring=autour&list )
• On the snowboard England ( http://s.webring.com/hub?ring=ukboarders )
• On the Sting ( http://nav.webring.yahoo.com/hub?ring=fieldsofsting&list )
The " invisible web "
Definition
The " invisible web " (or " hidden Web ") is not strictly a research tool, it is, in fact, part of the Web containing all documents (texts, videos , pictures. ..) were not indexed by traditional search engines (engines, directories ... ) .
Indeed, it is relatively easy to find the address of a site using conventional search tools , however it is almost impossible to perform a query on the contents of its catalog ( databases , if it has , accessible only via their own search engine) .
Origins of the " non- indexing"
Why such documents were not indexed are many and varied :
• Indexing concerns only music files ( MP3 , midi ) , images (gif, jpg ... ) , and documents in HTML ( and soon those in XML format). Impossible to find a document ( word processor , spreadsheet ... ) , animation (Flash) , or PDF, PostScript file ....
• Lack of access to data due to their dynamic (dynamic pages , databases )
• The robot in charge of research has been deliberately restrained according to certain criteria (levels of width and depth , followed by external links, page size )
• Some documents were simply "banned SEO " by their authors or by using meta tags or using files (" Robots.txt " )
• The algorithm used by the engine did not consider as relevant information contained in the page , relative to the application,
• The site is too new or has not taken the step to make reference to be present in the database tools .
Research Tools
Currently , it is estimated that the size of the " invisible Web " is approximately 30-35% of total content ... fortunately, there is a large amount of tools to allow us to discover this new world :
• Adobe PDF Search ( http://searchpdf.adobe.com ) : more than a million documents are available at this address (all in PDF format !) .
• All- One- Search ( http://www.allonesearch.com/ ) : the oldest tools of research databases: effective, but difficult to understand.
• AlphaSearch ( http://www.calvin.edu/library/searreso/internet/as/ ) another tool well done, but not to search.
• National Library of France ( http://www.bnf.fr/web-bnf/liens/index.htm ) : collection of bookmarks commented .
• Electronic Access Journal ( http://www.coalliance.org/ejournal ) includes information resources of several American universities.
• fossick ( http://www.fossick.com ) : nearly 1,400 links to the best specialized search tools .
• InvisibleWeb ( http://www.invisibleweb.com ) service search more than 10,000 archival data bases , catalogs ... all this organized thematically .
• searchability ( http://www.searchability.com ) links to many sites about the " invisible web ".
• The Invisible Web Catalog ( http://www.invisibleweb.com ) : "the" place to start research in this area.
• Webcats ( http://www.lights.com/webcats/ ) offers a selection of library catalogs worldwide .
Other tools
There is such a diversity search tool today , it would be completely unrealistic to try to identify all .
Some examples of these categories hard " classifiable "
• Search classified by geographical area addresses
FinderSeeker ( http://www.finderseeker.com )
Excite Travel ( http://www.excite.com/travel )
The limitations of these tools
Use one of the services mentioned above is not sufficient .
Indeed, each tool has its strong points , although sometimes they are developed on a common technology.
No tool is perfect , always use several in parallel.
For example , a first step would be to use a directory like Yahoo or Looksmart , then, depending on the results, it would be enough to move to engines or meta- engines.
Technology research tools
General Operation
The objective of research tools is to identify the sites, browse and store their content (usually in the form of index) to facilitate access to their pages.
These tools based almost all on the same architecture composed of three separate modules: software robot explorer (" spider ") , an indexing system , and search software (" searcher ") :
Each module has a clearly defined role :
• the "spider" surfs like a browser, retrieving and analyzing the maximum information (URL , title, relevant words ... etc. . ) From pages they visit.
• the indexing system is then responsible for storing and classifying this information appropriately in a database.
• the " searcher " , meanwhile , is responsible to find in this basis, the documents most relevant to the query that will be submitted.
The robot explorer software (" spider ")
The spider (also called crawler or bot ) is a software robot, a kind of " nosy " which explores autonomously meandering "Web" .
Its role is crucial , because it is part of his power which depend of the motor.
Its parameters must be careful because it is not uncommon for competitors engines use the same spider to their research ... finesse respective settings ( eg selective / massive exploration sites ) and to make a difference in their effectiveness .
These software "explorers" uses algorithms to periodically review millions of pages ( taking care not to turn loop ) and thus constitute a database of previously visited sites and documents .
Some spiders invest the full page , while others are just simply the title, summary prepared by the author of the page , and related parts called meta -tags ( meta tags ) .
These robots then identify hypertext links on the page ( checking that they have not already been visited) , to achieve the pages pointed . They thus rapidly traverse the entire site and the sites linked to it ... and so on.
This route can be done in width ( breath first search ) , depth (depth first search ) or even in a mixed ( width up to a certain point and depth).
Depending on the rate of rotation of robots on the network, a research tool therefore has a huge data base URL , allowing it to give a good approximation of the entire Web at any given time .
But be careful , because although spiders can " swallow " nearly ten million pages per day , only the most visited sites will be regularly updated.
Note: Code of Ethics of the Internet ( netiquette ) defines instructions that allow site administrators to define, through a file called " robots.txt " areas where robots can perform the work, and those , private they should not be cataloged.
The indexing system
The spider then returns the information collected for indexing engine they are analyzed . It then constructs an index of words encountered ( and addresses of the relevant pages ) and stores all in one database, generally speaking automatic indexing .
Note: the complete renewal of the index (called " refresh time ") may extend over a period of between one week and one month depending on the engines.
Data extraction
Electronic media are rarely texts " kilometer " but office documents , marked , or more generally, rich text (typography , presentation, layout ... ) .
These files , by their peculiar structure ( their own creation tool ) can not be indexed after conversion : the indexing engine that accepts , indeed, that the data in the form of continuous text.
Extraction of the texts before indexing tools are needed.
Some indexing engines use these techniques to classify documents in their "native" formats , not duplicate , while others prefer to transcode in a simplified format and then work only on the "poor" text.
indexing techniques
At first, only the titles of documents were indexed, but this solution has quickly shown its limits , firstly , because the title of a document does not always reflect its content, and secondly , because it generates many redundancies.
Then, the solution adopted was to store most all words of the title of the first paragraph.
Today , this method has abandoned this paragraph to focus on meta- data (or meta tags ) that allow the author to specify the words with which he wants to index the contents of its pages.
But these tags are still too few common ( indexing and cataloging became a full-time job ... ) , or poorly used ( there are many who cheat by overloading this tag ...) to be effectively used by robots. However, this idea is a solution for the future.
Note: this limit Altavista Indexing at Source 1024 characters ( 1KB , the characters are single-byte or eight bits).
Examples of indexing software :
• Altavista,
• Excite,
• Fulcrum
• Infoseek ,
• IntelliServ ( Verity )
• PLsearch ,
• Livelink ( Opentext )
• RetriviealWare (Excalibur ) .
The search module (" searcher ")
Appearance and operation
The searcher ( search engine ) is the visible part of the iceberg, the front of the user.
Showcase site This page mark ( home page) is updated regularly and usually decorated with contextual ads (which vary depending on the wording of the request made ) .
Through this GUI, the visitor can ask the question , select the available options, and click a button (or the Enter key on his computer ) to start the search. CGI (Common Gateway Interface ) script will then call the indexing system , so that it runs a query against the database containing the data collected on the Web.
The different types of question:
• The Boolean query : use logical operators (AND, OR, NOT , NEAR, etc.). . Each engine or directory does not offer the same number , and the formulation of the query to use also varies from one to another. A good knowledge of these tools becomes necessary to conduct relevant research.
• Search through list of words : the user query is transcribed by a Boolean expression system according to a pre- established scheme ( implicit AND , OR implied, etc.).
• The natural language search : This type of research requires indexing and "intelligent" search implementing language processing modules developed. Internet no system (yet ) as a processing power.
The technical research base
Classical Literature Search
All information search tool operates from index files , heart of the technology used , representing textual information stored in electronic documents .
Classical literature search using index type keywords , where index entries are standardized words or phrases from a controlled language (such as a thesaurus , for example) .
This is a proven technique that actually allows you to find some way of indexed information. Research is accurate and confronts an equation Boolean search content index files : it is perfectly effective and leaves no room , computer terms, any uncertainty as to whether the documents found correspond to the search equation .
However, it requires significant resources to index documents and quality of research results depends directly on the quality of indexing.
If electronic documents are unstructured text information , index entries can be much more numerous and little or no standardized. To be effective, the research can not be limited to a simple analysis of the presence / absence of words, the risk of providing rates of silence and excessive noise. This is actually on these issues as technologies have evolved significantly in recent years.
4.4.3.2 Text Search
Unlike a structured database information retrieval, text search is to find documents that " look " as much as possible the question. It is impossible to obtain all relevant and only those because the language is ambiguous documents. It is therefore to optimize the search results in terms of noise than silence, that is to say, analyzing the relevance of a document with respect to the question. And this is the difficulty. Especially it is not enough to examine a single database of information, but many scattered over a network, and it also becomes necessary to limit the replies submitted to the most relevant only.
The techniques used are varied, the main ones being language techniques, statistics and fuzzy search . They can be used alone or in combination .
Linguistic techniques
They can work in two ways: indexing and search .
• A indexing, the goal is to reduce the number of index entries, by putting in "canonical" form phrases and words, and will reduce to an input a word and its inflected forms , verb and its conjugations , but also , by a morphological and syntactic analysis of the text , the idiomatic expressions . For this, we will use a dictionary of the language, possibly customized based upon the field covered by the document corpus.
• A search will analyze the question in natural language to extract the most significant words, disambiguate the issue via an interactive dialogue with the user and generate a normalized query to the search engine .
The process is as follows :
Electronic dictionaries are not simple list of words or expressions, but real semantic networks that bind expressions by special relations and include definitions of terminology .
They are also used in the analysis of relevance to weight the results based on the semantic proximity of documents and the terms of the question.
The software package available today includes engines , or include in their basic salaries of language functions for indexing and search , or use third-party tools primarily linguistic research only.
Statistical techniques
Three broad categories of statistical techniques are available today and have different purposes .
• The first is to order a search result in order of decreasing relevance and will therefore calculate the weight of a document in relation to the question. For this, we take into account criteria such as the frequency of words in the document , the discriminating power of the words of the question depending on the rarity of the terms in the index, the distance between words and the question words found in the document, the keyword density of the matter in the document , etc. . Obviously, if the indexing and search engine uses linguistic techniques, these statistical techniques can be applied not to the words but the index entries ( standardized concepts ) to obtain a relevance ranking even better.
• A second category of statistical techniques will essentially be to assist the user in his quest . For example, if the user writes a Boolean search , we will indicate whether it wants a strict search or prefers an approximate result. The search mechanism will then analyze the different lots intermediate results - for each criterion - and mainly based on the criteria of word frequencies , may extend the strict batch result. Similarly , the similarity search is to build an application that allows to retrieve texts as similar as possible to a portion of text selected by the user. Again, this is via a weighting of criteria for research and statistical analysis that the results will be sorted and presented to the user in an ordered form .
• The third technique , called automatic classification, grouping search results by homogeneous and identify for each class, the words or phrases that best describe the class aims to classes . Thus, the user can quickly identify the classes that interest others. This is a statistical technique widely used in other areas (economic or demographic surveys , for example) . Refining step by step the question, the user can identify the closest expectations documents.
Fuzzy search
Fuzzy research aims to enable text search allowing errors either in the formulation of research is in the content of the texts. Typically, this is a technique widely used for research on texts that result from an optical character recognition (OCR) , in which avatars are known. Indeed, OCR generates two types of errors: unrecognized and poorly recognized those characters. In both cases , the words are " scrambled " by characters or parasites are approximate , since, for example , the OCR will recognized the term " irninente " instead of "imminent ."
The methods used are numerous and are based either on algorithmic techniques or techniques on neural networks.
One or more techniques?
Different research techniques , or more precisely to improve the relevance, are not competing but complementary and technological evolution of market drivers and the number of active partnerships show that if the problem is not simple the solution is probably the combination of available techniques.
One issue is particularly difficult management of multilingualism or at least information databases in multiple languages. If language techniques ( and parsing rules ) are obviously dependent on the language , it is not the same with other techniques ( statistical or fuzzy search ) . Indeed, if the principles of weighting and calculation may be slightly different from one language to another , the same criteria are applicable everywhere. This is probably one of the reasons why these techniques now have the wind in their sails.
Presentation of results
It is displayed as a dynamically generated web page , generally incorporating the responses in a list ( more or less detailed ... ) or as a response number (optional, as inconvenient) , ranked by of relevance to the query.
Whether to comply with the basic layout of the document , it will be necessary for indexing to locate the position of each word in the original electronic documents in order to highlight ( on - lineage) engine during the display results.
Architecture research tools
It is very difficult to find documentation on the internal architecture of search engines, each manufacturer wanting to preserve its technological secrets.
That is why , this part will focus mainly around examples of architectures existing engines (and who kindly give us some of their characteristics ) .
General Structure
Search engines use bases ( one or more ) which are indexed data that has been previously retrieved by the search tools on their exploration of the " Web ".
These databases can be either stored on one or more disks (or on one or more servers ) to parallelize up treatments that are applied to them.
The process of finding and indexing may also be particularly " gourmants " in terms of CPU resources , so their treatment can , in turn , be parallelized using grouped in cluster servers for example .
Example : the search DILIB
DILIB (available at http://www.loria.fr/projets/DILIB/dilib-0.2/DOC/index.html ) is a platform for Document Engineering and Information Science and Technology , containing among other things, some research based on a similar engine ( in architecture ) to those that can meet the " Web ".
Structure of the database
The effectiveness of a search engine depends largely on the speed with which the latter will be able to retrieve information from the contents.
That is why , as virtually all engines are organized around inverted files .
In fact, such files contain as many records as different descriptors , and thus allow to find a list of primary documents ( containing a descriptor) from a simple registration.
From a file of bibliographic records ( simple sequence of records) , generating an Information Retrieval System (IRS) product:
• Direct file ( bibliographie. ..) with a direct access device,
• Inverse files: one for each indexed field ,
• Various files: parameters stopwords ...
In the case of search engines " Web " records represent all data collected by the robot during his exploration of the Net.
Organization of hierarchical data
DILIB helps build inverted files on these structures HFD (Hierarchical File organization for Documentation) :
Principle
Key combination <-> Unix location
Each record is assigned a key 6 digits 000000 -> 999999 , key which is then divided into three areas of 2 figures : " dd ff rr ."
Entries are grouped by group 100 .
Each group is stored in a Unix file name is of the form: " ff.df ."
example:
123456 The notice will be the 56th in a file named 34.df.
Files " . Df " are in turn grouped in sets of 100 .
Each group is stored in a folder ( directory) Unix , his name is of the form: " dd.dd "
example:
123456 The notice will be the 56th in a file named 34.df stored in the directory 12.dd.
These directories are finally combined in a root directory whose name ends with. Hfd
Generalization
In fact HFD structures can manage larger ensembles playing on :
• The number of index levels
• The nature of the keys ( numeric, alphanumeric )
• The length of the key .
inverted files
Principle
This device improves the performance of the search engine .
The inverted files are created using data from direct file eg keywords or information about registration ( authors, title, ... ) .
This inverted file can contain a list of values arranged in order , a "fields" defined (keywords , author , ... ), and thus allows rapid access to the corresponding record in the direct file . ( A file created by reverse fields classify).
Example of direct file :
Registration No. Author Title Keywords
000000 Herge Tintin in the Congo Tintin, Snowy , dog
000001 Tintin in America Herge Tintin, Snowy , horse, dog
000003 The Dalton Morris, Goscinny Lucky Luke , horse
000004 Asterix the Gaul Goscinny, Uderzo Asterix , Idefix , dog
Example of inverted file ( keywords) :
Asterix 000004
Horse 000002 , 000003
Dog 000001 , 000002 , 000004
Idefix 000004
Lucky Luke 000003
Snowy 000001 , 000002
Tintin 000001 , 000002
Example of inverted file ( Authors ) :
Goscinny 000003 , 000004
Herge 000001 , 000002
Morris 000004
Uderzo 000004
The index tables
Principle
This arrangement also improves the engine performance .
The index tables are used to find a given keyword in a reverse file
An example : the index table of the authors of XML Server
Digital key symbolic key
ABRAMS (M.) 000000
DEROSE ( S ) 000100
JAAKKOLA (J.) 000200
MURTAGH (F. ) 000300
Suciu (D.) 000400
The index tables are used to associate a key item given the relative address of the latter, a sortable index or not.
The density of the index is the ratio between the number of the index key and the number of articles.
The parameter files
These are files that contain :
• Presentation Settings ,
• IT or technical parameters ,
• " language and literature " Settings.
Example search engine : ALTAVISTA
The " AltaVista Search Engine 3.0" engine is available on the website: http://solutions.altavista.com/
Minimum System Requirements
• An Intel Pentium 300 Mhz or higher,
• 256 MB RAM,
• at least 5 GB of free disk space after installation.
Supported Platforms
• Windows 2000 or Windows NT 4.0 SP6,
• Solaris 2.6, 7 or 8,
• Red Hat Linux 6.0 , 6.1, and 6.2 ,
• AIX Version 4.3 ( or later )
• Compaq Tru64 Unix 4.0D (or newer ) .
Business users
Many companies use the Altavista search engine to offer on their website:
• Abebooks.com ,
• Amazon
• Buy.com ,
• and many others ...
General architecture of the engine
As described above, websites are driven by robots and data are extracted (data sources). These data are stored in a database to be subsequently converted ( converters , feeder ) to be indexed ( index ), and finally stored in the base index (index).
The Altavista engine offers very fast web server named " mhttpd " , but it is still possible to use another as apache or IIS (Internet Information Server ) for example.
This web server can perform searches based on index and to send responses to the user.
Architecture servers
All engine components Altavista can be deployed on different servers in order to optimize the performance of individual components.
Solution: 2 servers
In this example, information retrieval ( machine A ) part is separated from the indexing and search (Machine B) part .
Solution: Servers in parallel
With increasing access to the Internet, the number of research can greatly increase , hence the need to add front-end systems to better distribute the load on the servers.
In this example a "smart routing " program will direct traffic on one server (Machine A or C ) after charging it, in order to get better response time.
Here are the search server base index and the base itself is redundant , recovery and data creation part from the "Web" has not been duplicated.
Solution: independent servers
Stains Data Recovery over the Internet can be deployed across multiple servers.
It is even possible to mix several types of platforms, and concentrate maximum power where the system is most needed , and that , depending on the type of load applied to the search .
conclusion
Today, one thing is clear : Internet is " exploding " growth is exponential, and can not seem to calm down ...
This is also not without its problems :
the significant increase in the amount of data available on the network even eventually lead to new needs in terms of information retrieval .
Alas, even if these tools have grown steadily the last years , there are still they still have a long way to go before reaching the performance offered by other types of engines ... Indeed , obtaining responses ( enough ) Relevant sometimes requires the combined use of several research tools.
However, we can be optimistic as we begin to see happen technologies using artificial intelligence , which seem particularly promising ( intelligent agents , natural language search ) .
glossary
Data base
this is a set of alphanumeric data on an identical substrate and interconnected in a coherent structure .
Internet
Vast collection of computer networks that exchange information through a suite of network protocols called TCP / IP.
meta Tag
It is a structure placed in the header HTML web pages, providing information that is not visible to browsers, but are used by the search engine for indexing.
The most common meta tags ( and more useful for Search Engines ) are KEYWORDS ( keywords) and DESCRIPTION .
KEYWORD meta -tag allows the author to emphasize the importance of certain words and phrases used in the page. Some search engines will consider this information - others ignore .
DESCRIPTION meta -tag allows the author to control the text displayed when the page appears in the results of a search. Some search engines may ignore this information.
protocol
Definition of the elements showing how the dialogue between different computers on a network.
spamming
Creating or modifying a document with intent to deceive a catalog or electronic filing system . Any technique that aims to increase the potential position of a site at the expense of the quality of the database Search Engine can also be seen as spamming (also known as spamdexing or spoofing ) .
Spamming is most commonly used to refer to sending email unsolicited bulk . The use of this word in the Search Engines is derived from this term.
TCP / IP
TCP / IP is a set of protocols: the transmission control protocol (TCP ) and Internet Protocol (IP) that allow different types of computers to communicate with each other. The Internet is based on this protocol suite .
Web
The web means more broadly the entire network of sites where you can navigate with your browser.
No comments:
Post a Comment