Gerard Salton

Written for IS580, 3/25/2005.

 

        Gerard Salton was born in Germany in 1927. He immigrated to the United States in 1947, where he attended Brooklyn College and received his Bachelor’s and Master’s degrees in mathematics. He earned his Ph.D. at Harvard, where he was one of the first programmers for Howard Aiken’s Mark IV computer at Harvard (Computing Research Association 1). He then taught for several years at Harvard before moving to Cornell University in 1965 as one of the founders of the Computer Science department. He received the first Association for Computing Machinery Special Interest Group on Information Retrieval (SIGIR) award for outstanding contributions to information retrieval, known as the Gerard Salton Award. He served as editor of the ACM Communications, the ACM Journal, and the ACM Transactions on Database Systems, in addition to serving on the ACM Council. He was the Chair of SIGIR from 1979-83.

        Salton published widely throughout his career, writing five books on information retrieval and over 150 articles in various journals. He was concerned with “natural-language processing, especially information retrieval” (Cornell 1). His work with SMART, the first vector space model in information retrieval, led to the development of information retrieval concepts such as statistical term weighting and relevance feedback, and is the theoretical foundation for most information retrieval systems today. Many Internet search engines use the algorithms and concepts developed in the SMART project. His experiments in information retrieval systems “greatly contributed to the knowledge base of computerized information indexing, storage and retrieval” (ASIS).

        Salton was also the first person to develop and advance the concepts of term frequency and inverse document frequency, often referred to as TFIDF, which are widely used in search engine algorithms. Term frequency measures how often a term appears in a collection of documents. Inverse document frequency measures the rarity of a word in the collection, by dividing the collection size by the number of documents containing the term. Words such as “the” and “it” are very common and have a low IDF, and are excluded from search results in many search engines. The TFIDF formula is calculated from term frequency in the document multiplied by inverse document frequency and is used to assign value to a term or to rank search results (Chau 6). It is used to determine which documents are most relevant to a query.

        The most famous contribution to information science by Salton is his work with the System for the Mechanical Analysis and Retrieval of Text, or SMART, which he developed in the 1960’s. His work with SMART “was the foundation of much practical and theoretical work in the field of Information Retrieval” (McGill 1). The SMART experimental project emphasizes automatic information retrieval from large texts and automatic indexing using natural language processing. It uses TFIDF to assign terms to a text, then classifies the texts into subject categories by shared index terms. It then uses the index terms to retrieve documents based on similarity between the document and the query. In a vector space model, each document in the system is encoded as a vector to represent “the importance of a particular term in representing the semantics or meaning of that document” (Berry 337). Once the documents and queries have been converted to vectors, the system can match documents and queries and rank them according to the frequency of the search term within the document. The document vectors in the database become the columns of a matrix (see graphic 1.1).

1.1 Example of a term-by-document matrix produced by vector space model (Berry 341)

 

One of the problems of the vector space model is that it may leave out a relevant document because the document’s indexed terms do not match the query closely enough. Polysemy and synonymy are also problem issues for indexing schemes or IR systems which can affect the precision of information retrieval. There are several ways to address this problem, such as term weighting and controlled vocabulary. The SMART system has experimented with linguistic techniques of content identifiers for natural language texts to solve the relevance and recall problems of the basic vector space model, including synonym classes and hierarchical term arrangement (Salton 318).

        Dr. Salton was very dedicated to information retrieval and computer science, staying loyal to the program and his SMART project during the 1970s when interest in IR was low (Cornell 1). He was a prolific writer, devoted a great deal of his time to the ACM, and shepherded many students through the graduate program at Cornell University. He also made time for interests outside of his active professional life. Gerry Salton was a sportsman in water and winter sports, and was a fan of Cornell University’s ice hockey team. He also enjoyed music and belonged to the Cornell University Faculty Committee on Music. He was also an avid flower gardener. Dr. Salton died of cancer on August 28, 1995.

 

Citations

 

ASIS. “Salton.” Information Science Pioneers. 14 January 1998. http://www.asis.org/Features/Pioneers/salton.htm

 

Berry, Michael W., Zlatko Drmac and Elizabeth R. Jessup. “Matrices, Vector Spaces, and Information Retrieval.” Society for Industrial and Applied Mathematics Review 41:2 (1999): 335-362.

 

Buckley, Chris, James Allan and Gerard Salton. “Automatic Routing and Ad-hoc Retrieval Using SMART : TREC 2.” Information Processing and Management 31.3 (1995): 315-326.

 

Chau, Michael, Zan Huang and Hsinchun Chen. “Teaching Key Topics in Computer Science and Information Systems through a Web Search Engine Project.” Journal on Educational Resources in Computing 3.3 (September 2003): 1-14.

 

Computing Research Association. “Salton dies; was leader in information retrieval field.” November 1995. http://www.cra.org/CRN/html/9511/people/none.12_1_t.shtml

 

Cornell University. “Gerard Salton: In Memorium.” http://www.cs.virginia.edu/~clv2m/salton.txt

McGill, Mike. “Gerard Salton, In Memorium.” IRLIST Digest, Volume XII, Number 34 Issue 271. September 4, 1995 http://www.dcs.gla.ac.uk/idom/irlist/new/1995/95-xii-34-271/Gerard_Salton,_In_Memorium.html

 

Salton, Gerard. “The SMART environment for retrieval system evaluation – advantage and problem areas.” Information Retrieval Experiment. Ed. Karen Sparck Jones. London, Butterworths, 1981: 316-329.