1948 was an auspicious year in the development of both scientific information management and the use of computers to search text files. The Royal Society Scientific Information Conference identified the challenges that lay ahead in managing the flow of scientific information; challenges that arguably we have not solved. The earliest research into how computers might help was undertaken by Philip Bagley (Bagley 1951) as part of a Masters project at MIT. His thesis was entitled Electronic Digital Machines for High-Speed Information Searching. He set out the basic principles of ‘information searching’ and wrote a program for the Whirlwind computer at MIT.
Following graduation, Bagley was employed at MIT Lincoln Laboratory, and then at MITRE Corporation, where he worked on the SAGE air defense system. In 1964 he moved to the Philadelphia area to enter graduate school in Computer and Information Science at the University of Pennsylvania.
He submitted his PhD dissertation in 1969, in which he coined the now widely familiar term ‘metadata’ but the thesis was not accepted, and published only as a report under contract with the Air Force Office of Scientific Research, entitled, Extension of Programming Language Concepts.
By June 1952 there was enough interest in the subject at a number of research centres across the USA to hold a Symposium for Machine Techniques for Information Selection at MIT. One of the speakers at the Symposium was Hans Peter Luhn, at that time working on punched-card retrieval systems for IBM. Luhn would turn out to be hugely influential in information retrieval and his hash algorithm (which he developed in the late 1950s) remains in use to this day.
Another very influential person was Eugene Garfield, who in 1955 published a paper in Science about the value of citation analysis. (Garfield 1955). From this approach Garfield launched his Institute for Scientific Information to commercialise citation analysis. His insight also became one of the innovations incorporated into Google at the outset in the 1990s, but that is another story. Of more immediate interest is a paper by Allen Kent and his colleagues at the Battelle Memorial Institute, Ohio. In this paper (Kent et al 1955) the concepts of ‘recall’ and ‘pertinency’ are proposed as metrics for a search application.
There were two further important conferences in the 1950s.The first was the International Study Conference on Classification for Information Retrieval, held in Dorking, UK in 1957. This was the first opportunity for UK and US research teams to exchange ideas and research on information retrieval. The USA may have had a technology lead, but the UK was held in high regard for research and implementation of classification and index frameworks.
A year later an International Conference on Scientific Information was held in Washington D.C. to take note of developments since the 1948 Royal Society conference and much of the discussion was about information retrieval. The papers make for some fascinating reading. By 1958 Dow Chemicals was evaluating how computer-based systems could be used to manage in-house documentation.
The chemistry community has some special information retrieval challenges (such as searching chemical structures) and has always been in the vanguard of search development. It was at an American Chemical Society meeting in Miami in 1957 that Luhn gave a paper on A statistical approach to mechanized encoding and searching of literary information (Luhn 1957) in which (in effect) he set out the constituent elements of a search application.
The following year Luhn published a paper on his work at IBM (Luhn 1958) in which in which (according to the abstract):
“Excerpts of technical papers and magazine articles that serve the purposes of conventional abstracts have been created entirely by automatic means. In the exploratory research described, the complete text of an article in machine-readable form is scanned by an IBM 704 data-processing machine and analyzed in accordance with a standard program. Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance, first for individual words and then for sentences. Sentences scoring highest in significance are extracted and printed out to become the ‘auto-abstract’.”
This was indeed a visionary approach. Luhn also proposed that the frequency of word occurrence in an article furnished a useful measurement of word significance. This is the origin of the now familiar term frequency – inverse document frequency model although it was not until 1972 that Karen Spärck-Jones developed a rigorous statistical basis for TF.IDF.
In 1959 Maron and Kuhns wrote a seminal paper entitled On relevance, probabilistic indexing and information retrieval (Maron and Kuhns 1960) in which in which they defined ‘relevance’ (to replace ‘pertinency’ and the use of ‘probabilistic indexing’ to allow a computing machine, given a request for information, to make a statistical inference and derive a number (which they called the ‘relevance number’) for each document. They suggested that this could be a measure of the probability that the document will satisfy the given request. The result of a search would then be an ordered list of those documents which satisfy the request, ranked according to their probable relevance. The achievement of high levels of relevance has since become the Holy Grail of enterprise search.
The importance of the paper is that Maron and Kuhns then evaluated their proposal through a manual (rather than computer-based) trial, so setting out not only the fundamental principle of determining the probability that a document was relevant but the importance of system evaluation. Fifty years later Maron published a short account (Maron 2007) of the background to this paper in which he provides a fascinating insight into how he and Kuhns developed this principle.
The transition from cards to computers is described in detail by both Harman (Harman 2019) and Robertson (Robertson 1994) A number of papers on the early history of the adoption of computers into the production of Chemical Abstracts were given at a conference held in 2014 on the Future of the History of Chemical Information.
Although Maron and Kuhns had shown that a probabilistic approach was superior to a Boolean approach, virtually all of what might be seen as the first generation of commercial search applications used Boolean logic because the challenge of calculating a ‘relevance number’ had yet to be solved. It is of note that Maron was at the RAND Corporation which had set up System Development Corporation (SDC) as a subsidiary. RAND spun off the group in 1957 as a non-profit organisation that provided expertise for the United States military in the design, integration, and testing of large, complex, computer-controlled systems. SDC became a for-profit corporation in 1969 and began to offer its services to all organisations rather than only to the American military. It played an important role in search development. Another important development in 1959 was the establishment of the Augmentation Research Center at Stanford Research Institute under the direction of Doug Engelbart.
By the end of the 1950s almost all the core elements were in place, including understanding the required modularity of the search process, the benefits of a probabilistic view of document retrieval, the concepts of precision, recall and relevance, and the value of testing and evaluation. What was needed now was computing power to provide an acceptable level of responsiveness when searching large collections of documents.
Bagley, P.R. (1951). Electronic digital machines for high-speed information searching. MIT Press. http://hdl.handle.net/1721.1/12185
Garfield, E. (1955). Citation indexes for science: a new dimension in documentation through association of ideas. Science, 122(3159), 108-11.
Kent, A., Berry, M.M., Luehrs Jr., F.U., & Perry, J.W. (1955). Machine literature searching VIII. Operational criteria for designing information retrieval systems. American Documentation, 6(2), 93-101. https://onlinelibrary.wiley.com/doi/10.1002/asi.5090060209
Luhn, H.P. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal. October 1957
Luhn, H.P. (1958). The automatic creation of literature abstracts. IBM Journal. April. https://ieeexplore.ieee.org/document/5392672
Maron, M. & Kuhns, J.L. (1960). On relevance, probabilistic indexing and information retrieval. Journal of the ACM. July. https://dl.acm.org/doi/10.1145/321033.321035
Maron, M.E. (2007). An historical note on the origins of probabilistic indexing. Information Processing and Management, 44, 971-972.
Harman, D. (2019). Information retrieval: the early years. Foundations and Trends in Information Retrieval, 13(5), 425-577. http://dx.doi.org/10.1561/1500000065
Robertson, S.E. (1994). Computer retrieval as seen through the pages of the Journal of Documentation. In B.C. Vickery (Ed.) Fifty years of information progress. (118-146). Aslib.