Using Google to retrieve terminological information

 

Google hacks
Tara Calishain, Rael Dornfest

Helpful information about Google syntax is also available on www.googleguide.com 

1) Introduction : How Google works.

PageRank Explained (http://www.google.com/technology/)

Google is continuously traversing the web in real time with software programs called crawlers, or “Googlebots”. A crawler visits a page, copies the content and follows the links from that page to the pages linked to it, repeating this process over and over until it has crawled billions of pages on the web.

PageRank uses the web’s vast link structure as an indicator of an individual page's value.

In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote.

Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

Important, high-quality sites receive a higher PageRank, which Google remembers each time it conducts a search.

Google combines PageRank with text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines all aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.

 

2) A few basic rules

AND is automatically implied between terms. All your terms are somewhere 

in the text of pages

in pages that link to a result page

in other pages on the same site

 

PageRank favors pages with your words

in phrases

close together

in the order typed

 

Google stems some words (it finds words with various endings)

The search

flat rent

will return results in which the word "rental" is used.

 

You can turn off the stemming function by using + as in flat +rent

Words like rental or rented may be present in the pages that are returned, but they will not be in bold characters. 

Google ignores common or "stop" words. The search

a flat to rent in the Midlands

will return results in which the words flat, rent and Midlands are in bold characters

 

3) Punctuation 

A few punctuation signs are not ignored

v  Apostrophe ( ' ) : people's, and peoples are searched as different words, but peoples and peoples' give the same number of results.

people's China

peoples' China

v  Hyphen ( -- ) : same-sex retrieves same-sex, same sex and samesex

Consequently, you should always supply the hyphen to search any word that might be used in its hyphenated spelling.

v  Accent marks in Roman-alphabet foreign languages: elephant returns results for both elephant and éléphant, while +éléphant only returns results with the French spelling.

 

"samesex marriage" Burma yields few results. Use Google's suggested spelling. Check the number of hits for the following queries:

"samesex marriage" Burma.

"same sex marriage" Burma.

"same-sex marriage" Burma.

 

4) Number range

Follow search terms with beginning and ending numbers, separated by two periods.

This syntax can be used one-sided, as less than or greater than

To find pages mentioning Babe Ruth and dates between 1921 and 1935, type:

"babe ruth" 1921..1935

To find pages mentioning Picasso and dates until 1905, type:

Picasso ..1905

 

5) Use of Boolean operators and quotes

 

lung cancer

cancer lung

"lung cancer"

AND is not necessary if OR is not used:

"lung cancer" Sweden

 

NB : For high numbers, Google counts may be inflated by two-thirds.

http://aixtal.blogspot.com/2005/02/web-le-mystre-des-pages-manquantes-de.html

 

Boolean operators sometimes give surprising results. Compare the results for:

"lung cancer" DenmarkSweden

"lung cancer" SwedenDenmark

"lung cancer" Sweden Denmark

"lung cancer" AND (Sweden OR Denmark)

 

6) Other characteristics

Google is not case-sensitive but some search engines are.

No query can be longer than ten words (further words will be disregarded)

Google doesn’t support word-stemming (some search engines do), but the * stands for a number of intervening words :

"coronary * disease

returns instances of both Coronary artery disease and Coronary Heart Disease

 

7) Specific syntax

Careful! These operators ARE case-sensitive!

 

Searching for a document whose title contains a specific character string

intitle:"lung cancer

Be careful not to leave a space between the colon and the quotes

intitle:"lung cancer" Sweden

gives us all the pages whose title contains lung cancer and whose body (text) contains Sweden .

For a title that contains both, we need allintitle:

allintitle:"lung cancer" Sweden

Several words or expressions may be combined:

allintitle:"lung cancer" Sweden radon smokers

NB : allintitle is not easy to combine with other operators.

 

Searching for a document whose URL contains a specific character string

inurl:cdc "disease prevention"

returns pages about disease prevention on the CDC (Centers for Disease Control and Prevention) website.

allinurl:www.epa.gov ma

gives us pages about EPA (Environmental Protection Agency) programs in the state of Mass.

"allinurl:" works only on character strings, not URL components. In particular, it ignores punctuation. 

Thus,

allinurl: edu/gov

restricts the results to pages with the character strings "edu" and "gov"" in the URL, but does not require that they be separated by a slash within that URL, that they be adjacent, or that they be in that particular word order.

sickle cell disease” inurl:gloss

mucoviscidose inurl:anglais

tissue bundles” inurl:proz

 

<Intext> ignores links, URLs and titles.

intext:html

intext:"yahoo.com"

Starting a query with the term "allintext:" restricts the results to those with all of the query words in only the body text, ignoring link, URL, and title matches.

 

Searching for a document whose anchors contain a specific character string

inanchor:"Multiple Sclerosis

anchor = highlighted text leading to a hypertext link

Because of the way that Google's algorithm used to work, a page was ranked higher if the sites that linked to that page used consistent anchor text. The ranking system was altered in 2007.

Try "miserable failure" and read the BBC page about Google bombing.

See also : http://en.wikipedia.org/wiki/Google_bomb and

http://searchengineland.com/bush-fix-your-miserable-failure-legacy-16036

 

<Site> restricts your search to a domain

"Multiple Sclerosis" site:edu

"Multiple Sclerosis" site:med.utah.edu

perl site:edu site:com

 

returns no results. The following syntax returns results from either edu or com domains:

 

perl (site:edu | site:com)

 

<Link> finds all the pages with a specified link (useful for web site administrators)

 

You can dispense with http://

link:http://www.univ-lyon2.fr/

returns n liés à http://www.univ-lyon2.fr/

 

No other query terms can be specified when using this special query term.

Starting a query with the term "allinlinks:" restricts the results to those with all of the query words in the URL links on the page.

The combination inurl + site makes it possible to find the number of subdirectories in a given folder :

site:nytimes.com inurl:politics

returns all the URLs for the politics sub-folder.

 

site:univ-lyon1.fr inurl:polycop

returns all the URLs with online material for medical students

 

The query prefix "filetype:" filters the results returned to include only documents with the extension specified immediately after (no space after the colon).

“Addison’s disease” filetype:doc OR filetype:pdf

The query prefix "-filetype:" filters the results to exclude documents with the extension specified immediately after.

cholecystectomy -filetype:doc -filetype:pdf

 

http://scholar.google.com/  specifically targets scientific sites.

 

 

8) Using the wildcard character (*) to replace one or several words

 

Application : looking for translations for the term gluten-sensitive enteropathy

 

We may safely assume that the words entéropathie and gluten are part of the French expression that refers to the same notion.

"entéropathie * gluten"

"entéropathie au gluten" seems to be the most common French expression for gluten-sensitive enteropathy.

"entéropathie au gluten"

In order to find synonyms, we can filter the other half of the results with:

"entéropathie * gluten" -"entéropathie au gluten"

OR

"entéropathie * * gluten"

The synonyms are:

"entéropathie par/d’ intolérance au gluten

"entéropathie sensible/intolérante au gluten

"entéropathie de sensibilité au gluten

"entéropathie liée au gluten

"entéropathie induite par le gluten

"entéropathie dépendante du gluten

 

9) Using Google as a spell-checker

Google can be used to check the most common spelling of a word that may be hyphenated:

auto-formation

autoformation

 

10) The "define" and "glossary" functions

"normocytic anemia" /define

"normocytic anemia" /glossary

 

Exercice d'application :

1) Trouver les pages Web en français publiées au Canada et dont le titre contient le terme "sclérose en plaques".

2) Trouver les pages Web dont le titre contient "sclérose en plaques" et "Canada". Noter le nombre de pages concernées et les URL.

3) Trouver les pages Web en anglais traitant du cancer au Niger, à l'exception du cancer du sein.

4) Trouver les pages Web hébergées sur le site de l'Université Lyon 1 traitant de la maladie de Parkinson et de la maladie d'Alzheimer.

5) Trouver toutes les pages menant vers le site officiel du président des États-unis.

6) Sachant que dans le domaine de la cardiologie, "grande circulation" se dit "systemic circulation", trouver sur le Web la traduction anglaise probable de "petite circulation".

7) L'expression "pathologie infectieuse" est-elle typique du domaine médical? Vérifiez ses emplois au pluriel. Comment peut-on expliquer ces différences d'emploi?

8) Comparez la fréquence des emplois de l'expression "au décours de" sur l'ensemble du Web et dans le sous-ensemble du Web défini par les 4 types d'URL mentionnés ci-dessus.

 

11) Other on-line resources for terminologically oriented research

a) The metasearch dictionary at http://onelook.com

Takes the * joker for character strings, which proves convenient if there is a doubt about suffixes

Finding the translation equivalent for a French term (anémie normocytaire)

normocyt* anemia

gives normocytic anemia

 

artère basilaire

basil*

The On-line Medical Dictionary is free (the definitions are very brief, but all the terms are referenced with HT links):

basilar artery

Supplies the pons and gives rise to the vertebral arteries, Provides branches to the cerebrum and cerebellum.

 

1 link to a general dictionary (Dictionary.com) which gives several definitions from various dictionaries

basilar artery n.

The union of the two vertebral arteries, running from the lower to the upper border of the pons, with anterior spinal, the two inferior cerebellar, the labyrinthine, pontine, and superior cerebellar branches.

Source: The American Heritage® Stedman's Medical Dictionary
Copyright © 2002, 2001, 1995 by Houghton Mifflin Company. Published by Houghton Mifflin Company.


Main Entry: basilar artery
Function: noun
: an unpaired artery that is formed by the union of the two vertebral arteries, runs forward within the skull just under the pons, divides into the two posterior cerebral arteries, and supplies the pons, cerebellum, posterior part of the cerebrum, and the inner ear

Source: Merriam-Webster Medical Dictionary, © 2002 Merriam-Webster, Inc.


basilar artery

n : an unpaired artery; supplies the pons and cerebellum and the back part of the cerebrum and the inner ear [syn: arteria basilaris]

Source: WordNet ® 2.0, © 2003 Princeton University

 

b) using « glossary » (or its equivalents) in URLs in order to find definitions

 

splénique inurl:glossaire

www.medinfos.com/principales/ glossaire/veinesplenique.shtml

www.medinfos.com/principales/ glossaire/atrophiesplenique.shtml

 

For a higher number of hits, glossary can simply be entered as a search word:

splénique glossaire

http://www.chups.jussieu.fr/polys/anapath/Cours/POLY.Glos.html

 

glossary site:geocities.com

n provenant de geocities.com pour glossary.

 

glossary medical site:geocities.com

n provenant de geocities.com pour glossary medical.

 

"mitral stenosis"             

SYNONYMS 1 =

intitle:dictionary OR intitle:glossary OR intitle:lexicon OR intitle:definitions

 

n pour "mitral stenosis" intitle:dictionary OR intitle:glossary OR intitle:lexicon OR intitle:definitions

 

OTHER ABBREVIATIONS FOR “GLOSSARY” COMMONLY FOUND IN URLs

inurl:dict OR inurl:gloss OR inurl:glos OR inurl:dic

 

 

Acronyms

http://www.acronymfinder.com/