Grand Computing Challenges for Sociology

William Sims Bainbridge

Social Science Computer Review 12:2, Summer 1994, pp. 183-192.

Abstract

Consideration of four possible grand computing challenges in sociology suggests that progress will come from wholly fresh approaches, rather than from mere improvements in current kinds of social science computing. Although it will be useful to access ordinary libraries over the future universal communications net, placing conventional sociological data sets on it will accomplish little. More valuable would be development of technologies that would allow social scientists far greater access to data already collected by government and industry while protecting the confidentiality of information about individuals. Especially exciting would be automatic systems to read data from hand-written records like the massive historical censuses of the 19th century. In nearly every area of social science, artificial social intelligence has the potential to revolutionize research and theory. Keywords: computing, sociology, networks, government data sets, U.S. census, artificial intelligence.

Grand Computing Challenges for Sociology

If there are grand challenges for computing in the social sciences, they are not going to involve mere improvements in the kind of machine-assisted data analysis sociologists have been doing for three decades. The real grand challenges are to take quantitative analysis to a whole new level of achievement and - even more grand - to find radically new ways of thinking based on as-yet-unachieved capabilities of future computing technology.

Thus, I am skeptical about the "more-is-better" philosophy that tells social scientists to migrate onto supercomputers and plug into Internet. To be sure, many good things may come from these increases in raw computing power and access to data. But imagination and the search for other directions of achievement are more important. Let us consider four possible grand challenges in light of this principle: social data on the universal net; massive, confidential government data sets; historical censuses of the United States; and artificial social intelligence, or ASI.

Challenge I: Social Data on the Universal Net

Social science would benefit greatly if the entire contents of the Library of Congress were made available over Internet or over the information highway of the future, not the least because it would give every researcher access to the material that now only a few can use. For years I have been critical of the famous sociological classic. Suicide, by Emile Durkheim (1897), but my understanding of the defects of that influential book deepened significantly when I found Adolf Wagner's (1864) excellent statistical study in the Library of Congress, the book from which Durkheim took much of his data and which has a far higher standard of scientific analysis. Every scholar should recognize the potentially great value of employing the net to make interlibrary loans cheap and instantaneous, giving everybody ready access to all publications.

It is an illusion, however, to believe that making most sociological data sets available over the network would be highly valuable. Consider the General Social Survey (GSS) (Davis and Smith, 1991), one of the most widely used sociological data sets. A single 3.5-inch microdisk contains the 1987 GSS in the Microcase data format (Cognitive Development, 1990). In a file of 595,255 bytes, it holds 1,819 cases and 450 variables, along with a codebook. The data are in a compressed form, but Microcase can analyze it quite effectively without decompression. A seven-variable multiple regression or a four-dimensional crosstab runs at a thousand cases per second on an old 386sx microcomputer, even if you aggregate all 18 years of the GSS to get over 27,000 cases. This is a triumph of modern computing and a thoroughly mature technology that needs no further development.

The disk itself is 10-year-old technology (and so sturdy that a burglar once stepped all over my disks without damaging them). The disk can be sent by ordinary first-class mail at low cost and high reliability anywhere in the world in a matter of days, and the computer need not be plugged into the net. Perhaps the best medium for dissemination of data sets is CD-ROM. Is it an advantage to receive the data instantaneously over the net versus waiting a few days for them to come by mail? Not much, primarily because it takes a long time, probably measured in weeks not days, to get ready to analyze the data.

First, the user needs to become familiar with the GSS. Respondents are a cluster sample of the noninstitutionalized American adult population (which means no military personnel were included), and this limits the inferences one can draw from it. The questionnaire items have long histories, and the user needs to study the scientific meaning of each item, a job that is facilitated by the extensive index by items that GSS offers of the 2,500 publications, dissertations, and reports that have been based on it.

Second, the user must make decisions about the sample of respondents. The 1987 GSS included an oversample of African Americans; these 235 extra respondents must be removed, or the races must be analyzed separately, or an elaborate statistical weighting procedure must be employed. For many purposes it can be valuable to aggregate data from two or more years, then perhaps draw subsamples to focus on particular kinds of respondents.

Third, the user must do extensive recoding. With the exception of sex, where each respondent is coded either male or female, all variables require receding before statistical analysis, both to handle missing data and to put responses in a reasonable order. Several different recoding schemes might be appropriate, depending upon your scientific aims and the distribution of responses to the particular item.

Fourth, a precise plan must be developed for multivariate analysis. This may mean construction of indexes by combining a number of variables, complete with evaluation of the reliability of each index. It may mean choosing among a number of measures of a particular concept. The right control variables need to be selected, and multicolinearity among independent variables must be assessed.

My point is that many such decisions must be made, and elaborate data transformations completed, before analysis can begin. To do this, you need the 1000-page codebook, a pile of publications that have already used the particular items, plus knowledge of the theoretical literature in the field of your research.

Students should not be misled into believing that valid research can be done at the touch of a computer key. The time required to get a data set by mail rather than on the net is inconsequential compared to the time needed to get ready to analyze it. Investment in expanding the net and putting GSS on it will accomplish little. The real bottleneck for the social scientist is the cost of collecting new data - over a million dollars a year for the General Social Survey.

Challenge II: Massive, Confidential Government Data Sets

It is a mistake to believe that big data sets demand big computers. Suppose you want to do a cross tabulation of the following census variables to control for age and education in a comparison of incomes between the sexes: age in years (values: 0-120), sex (values: 1-2), years of schooling (values: 0-25), and income (values: 1-15 income ranges). How much RAM data space do you need for 250 million cases? The answer is only 377,520 bytes (or 369K). That is the size of a four-dimensional array, with one cell for every possible combination of values on the four variables, employing longint (long integer) variables, which take 4 bytes per cell. You do not need the entire set of raw data in RAM, but need merely to add the number 1 to the appropriate cell simultaneously with reading each case from CD-ROM or other massive storage" system. The amount of RAM required is independent of the number of cases but is determined by the complexity of the analysis you want to do.

Thus, current desktop microcomputers are a mature technology, capable of handling the most massive data sets for the kinds of statistical analysis sociologists do, and supercomputers are irrelevant. The technical challenge is to develop storage media that can hold the data sets and move them in and out of the computer fast. But very high speeds are not necessary. The logic of science is batch processing, not interactive computing. You have a theory, and you derive testable hypotheses from it. You find or collect suitable data including measures that operationalize the concepts in the hypotheses. Ideally, you know exactly what tables you need and are content to wait a few hours to see the results.

The problem is that massive databases are too expensive for individual social scientists to collect, and this is one reason the National Science Foundation supports such continuing multiuser projects as the General Social Survey and the Panel Study of Income Dynamics (Hill, 1992). The really big existing data sets belong to government (e.g., the census) or industry (e.g., records of long-distance telephone calls). And the chief obstacle to social scientific analysis is not technical or economic but ethical.

In the 19th century, census takers posted copies of their completed enumeration sheets in the town square, so that everybody could check their accuracy. Since a presidential proclamation issued in 1910, however, individual census data have been kept under strict confidentiality. The names of the people in the household are not entered from the paper questionnaires into a computer. After microfilming, the questionnaires are destroyed, and the microfilms are not made public until the passage of 72 years (Bureau of the Census, 1992). Relatively small public-use samples may be made available, but only in a form that preserves the anonymity of respondents.

Measures taken to insure anonymity and confidentiality severely limit legitimate research possibilities. For example, the published versions of Canadian censuses often round off numbers, at random up or down to a number ending in 5 or 0. I have found this troublesome when working with rates where the numerators are small numbers, because the "noise" in the data is extremely loud. In the case of U.S. censuses, the lack of names prevents linking individual records across years - for example, to examine migration patterns.

At present, a bona fide social science researcher can go at great expense to the Census Bureau in Suitland, Maryland, be sworn in as an employee, and under the close scrutiny of regular employees gain somewhat increased access to confidential computerized data. The researcher faces severe legal penalties for violating confidentiality, and data cannot be removed. A satellite research site is currently being developed in Boston, in a joint effort of the Census Bureau and the Economics Program of the National Science Foundation, to give university researchers'" greater access to government data on corporations. If it is successful, other such satellite laboratories may be created and provided with a wider range of data.

To a significant extent, data confidentiality can be treated as a technical problem, and a grand challenge for the social sciences is to develop technical means for insuring it while making raw government data widely available to researchers (Keller-McNulty & Unger, 1993). It is not difficult to write statistical software that prevents users from inspecting individual data records. The user could not see variables that directly identify individuals, and could not employ crosstabs, subsamples, or other means to indirectly identify them. Of course, the system would also employ all the conventional means for limiting access to only those persons who have been given authorization, on a variable-by-variable basis.

Appropriate data encryption can add flexibility. Names could be entered into the computer as codes that cannot be read by humans but in a form such that the computer can still link records. Public-key cryptography techniques (Merkhofer, 1981) can preserve anonymity while allowing the researcher to add fresh data - for example, attaching county-level statistics to individual census records without being able afterward to identify the county an individual lives in. These techniques not only must actually preserve confidentiality, but they must do so in a manner convincing to the public. The data exist, and the government has access to them. Giving appropriately controlled access to social scientists will serve the public interest, not the least for insuring greater accountability of the government itself.

Challenge III: Historical Censuses of the United States

The National Archives and other data repositories contain vast files of historical data that can be used to test general theories of social behavior as well as to illuminate episodes in the nation's past. For example, the original schedules of all censuses from 1790 through 1920 (except for most of 1890, which was destroyed by fire) are currently available on microfilm. These are extremely difficult to use, but if they were in machine-readable form, they would revolutionize many fields of scholarship (Burton & Finnegan, 1991).

The best start would be a major program to enter the entire 1860 U.S. census, in conjunction with development of new software techniques to automate the process completely. About 31.5 million people were counted, and considerable information about organizations such as churches and businesses was also collected; the handwritten data fill 1 million large-format pages. With the exception of slaves, who unfortunately were listed only by number under the names of their owners, all residents were listed by full name and birthplace. Surprising quantities of other data were taken, such as the denomination of all clergy and the supposed cause of insanity for the mentally ill (Bainbridge, 1984b 1992). The chief challenge would be developing machine techniques for reading the handwriting, and these exist only partly at the present time.

Several features of the census manuscripts and related data sets can help computers learn to read them accurately. Because individual census takers recorded data for whole enumeration districts, there is a substantial sample of each "hand" that will permit multiple comparisons. Especially during the early decades of the 19th century, many people added an extra loop to the numbers 2 and 3, so that 2 looked like a modern 3. But once the computer has located an undeniable 3, in a particular hand, it will be able to recognize 2 correctly. Throughout the century, many writers added a loop to the right-hand foot of K and similar letters, so that the name Kane would look like Keane. To resolve this, the computer could scan through the list of birthplaces, until it found Kansas or Kentucky. More systematically, the computer could begin reading a section written in a particular hand by compiling a list of letters and letter combinations from birthplaces and occupations, and it could iteratively perfect its reading of names by going back and forth among them.

Duplicate handwritten copies of many of the old records still exist, permitting a partial double check. The official instructions to census takers for the 1850 and 1860 censuses (Department of the Interior, 1860) required them to make two handwritten copies of the original census form they carried door-to-door. One copy went to the state and the other to the federal government, and in many cases both copies still exist.

Often, someone has already attempted to read the names and has produced an index that may not be perfectly reliable but can assist a fresh reading done by the computer. Decades ago, the federal government produced "soundex" indexes (Madron, 1985), completely covering the 1900 and 1920 censuses and partially the 1880 and 1910 censuses, which reduced last names phonetically to codes consisting of a letter followed by three digits. Originally in the form of 3" x 5" cards on which the names were freshly written in a different hand, these are readily available on 16mm microfilm. Printed indexes have been published for a number of states and years, all ready to be scanned optically. The Church of Jesus Christ of Latter-Day Saints (Mormon) has created splendid genealogical databases that could be a great help. The census pages could be stored as images, as well as in ASCII, so scholars would be able to compare the original.

The ultimate aim would be to computerize absolutely all handwritten historical records, with the 1860 census merely the demonstration project that would develop the techniques. Among the most exciting associated projects would be linking 1860 census records to the enlistment records of the Confederate and Union armies for the Civil War, which began just a year after this census was taken. This majestic data set not only would provide much deeper understanding of this watershed in American history but would make possible many studies to test theories. Who first volunteered for the rebel army? Was it the sons of slaveholders defending their wealth? Or was it poor whites who sought upward social mobility? Or was it the neighbors of the men who volunteered the day before, drawn in by their social network? Or was it immigrants joining up in order to establish themselves as full members of the community?

Automatic linkage of separate massive data sets, not only after they had been machine read from handwritten records but even as part of the reading and verification process, has such great potential it staggers the sociological imagination. Some years ago tremendous human effort went into creation of the National Panel Study (NPS), a sample of 4,041 white males linked across the 1880 and 1900 censuses (Landale & Guest, 1990). Had it been possible to link the entire nation at once, there would have been much greater confidence that the correct records were linked - that this person in 1900 is the same as that one in 1880. With data like the NPS, competing scientific theories can be tested in such areas as social mobility, demographic processes, and community development (cf. Bainbridge, 1982, 1984a).

Challenge IV: Artificial Social Intelligence

The Sociology Program of the National Science Foundation held a workshop in conjunction with CSS93 to examine the potential of artificial social intelligence (Bainbridge et al., 1994). Edward Brent's article in this issue includes a number of references that would be of use to sociologists who want to become acquainted with ASI; here I shall offer a preamble.

Broadly defined, ASI is the application of machine intelligence techniques to social phenomena. ASI includes computer simulations of social systems in which individuals are modeled as intelligent actors, and it also includes methods of analyzing social data that employ any of the techniques commonly called "artificial intelligence" by computer scientists.

Some ASI programs are simulations, but most existing simulations are not ASI - for example, those that simply plug values into a multiple regression equation that models a particular aspect of a social system. Thus, it might be best to employ a new term for smart simulations to distinguish programs that are based on ASI, and nominations are now open for what that word should be. In a sense, ASI programs are not simulations of social intelligence, but real social intelligence that happens to be rooted in machines rather than in biological organisms. They make decisions, take actions, perceive the results of their actions, learn to adjust their behavior to environmental contingencies, and respond to the actions of others.

Human intelligence is not a purely individual quality, but is based in culture and social interaction. Thus, it may have been a mistake for computer scientists to attempt to develop artificial intelligence without benefit of artificial societies, and ASI is an important contribution social scientists can make to computer science.

Among the current and possible future applications of ASI are the following, roughly arranged in three categories: theory, data collection, and data analysis.

Simulation of human social intelligence can be a valuable tool for advancing theory. For decades, computer simulations have been used throughout the natural sciences and some social sciences to explore the implications of a set of ideas and to model real-world processes. Artificial intelligence techniques can be added to many kinds of simulation, but the most exciting application is the modeling of human intelligence and communication in simulations of social interaction. A typical program of this kind employs a neural network or other computer representation for each intelligent individual, and it embeds a number of them in a system of interaction. This approach can be used to explore the rigor and implications of a theory. A suitable program is written that incorporates some axioms of the theory, and as it runs it will generate results that represent theorems effectively derived from the axioms by the simulation.

ASI can also contribute through logical system models of social theories. For a decade or more, computers have been employed to perform logical analysis, and they have become an important tool for mathematicians. For example, the famous four-color problem in topology apparently could not be solved by the unaided human mind, but it has been solved in a computer-assisted proof. Similarly, computer programs can assist in testing the logical adequacy of an existing theory or in developing new theories. Logical engines and expert systems are among the most promising approaches.

Among the roles for ASI in data collection are several in experiments with human subjects, where computers have long been employed. For example, in social psychology experiments on bargaining, human subjects may communicate with each other over a computer system that administers the experiment and records their actions. Today, some experiments use computer techniques to simulate the exchange partners, perhaps employing only a single research subject at a time, who is under the false impression that other humans are sending the messages received over the computer. Of course, the material can all be prerecorded, in which case the computer system is not intelligent. But ASI allows the computer to react to the research subject in complex ways, thus convincingly simulating other research subjects and operationalizing sophisticated social theories. Potential for ASI experimentation exists not only in social psychology but also in economics and political science, even in areas that have not previously employed the experimental method.

ASI will also be an essential component of computerized, open-ended, free-form interviewing (COFI). Computer-assisted telephone interviewing (CATI) has proven that computers can facilitate ordinary interviewing conducted by human beings, and a number of surveys have now been done in which the respondent receives the questions and gives answers directly on a computer. In the future, ASI software will permit subtle, open-ended, free-form interviewing either assisting or without a human interviewer. Aside from gains in efficiency and reduced cost, it will be possible to achieve greater rigor - for example, by preventing the interviewer from leading the respondent and by keeping a complete and explicit record of all interviewer decisions.

Another important application will be natural data flow monitoring. Perhaps the greatest limitation faced by the social sciences is the near impossibility of collecting substantial data sets in natural settings, and survey research is just a poor substitute for observing what people actually do. Computers are beginning to play significant roles in monitoring economic flows (in banking), communication flows (the phone company), and traffic flows (urban signal-light control systems). Work is in progress to computerize satellite-imaging analysis - for example, at NASA'S Jet Propulsion Laboratory - and this requires design of new kinds of intelligent systems. To be sure, there are substantial ethical and legal problems, as well as economic and technical ones, in the monitoring of human beings. But intelligent systems may become very useful tools of data collection in public areas and when informed consent can properly be gained.

Once data have been collected, ASI can enhance techniques of analysis, for example in smart scaling. Many traditional techniques of data reduction come close to being ASI because they can be conceptualized as perception systems. A particularly good example is multidimensional scaling (MDS), because the standard algorithms employ learning and decision making, clear hallmarks of intelligence. Potentially, any data-reduction technique can be enhanced with artificial intelligence, and some of the same algorithms are used in advanced scaling programs as in artificial intelligence research - for example, "simulated annealing" (McLaughlin, 1989). If social scientists, statisticians, and programmers can become aware of the affinities between traditional data-reduction techniques and artificial intelligence, they can be more creative in developing new methods that will serve the need for validly reducing uninformative complexity in large data sets.

ASI will also help in analysis of content and meaning. We cannot make good use of all that information flooding in over Internet unless we can at least partially automate understanding of it. Already, advanced methods of content analysis are available, and the possibilities for machine perception of meaning are limited only by the human imagination.

Conclusion

In many respects, the key to all social scientific grand challenges will be artificial social intelligence. The techniques to preserve confidentiality of data while analyzing them effectively require the computer to be smart. Reading handwritten records requires substantial intelligence. When everyone has access to everything on the universal communication net, the user will need to rely upon an intelligent computer "agent" to help sift through the data overload and find just the information that is interesting. But there is a feedback loop here. Our computers must become smarter, if they are to serve us well, whether or not they also become much larger and more interconnected. And we must become smarter if we are to design them and use them to achieve the social science of the future.

Note

William Sims Bainbridge is the Sociology Program director at the National Science Foundation. The views expressed in this article do not necessarily represent the views of the National Science Foundation or the United States.

References

Bainbridge, W. S. 1982. Shaker demographics 1840-1900: An example of the use of census enumeration schedules. Journal for the Scientific Study of Religion 21:352-65.

---. 1984a. The decline of the Shakers: Evidence from the United States census. Communal Societies 4:19-34.

---. 1984b. Religious insanity in America: The official nineteenth-century theory. Sociological Analysis 45:223-39.

---. 1992. Social research methods and statistics: A computer-assisted introduction. Belmont, CA: Wadsworth.

Bainbridge, W. S., Brent, E., Carley, K. M., Heise, D., Macy, M., Markovsky, B. Skvoretz, J. 1994. Artificial social intelligence. Annual Review of Sociology 20.

Bureau of the Census. 1992. 1990 census of population and housing: Guide, Part A. Washington: Department of Commerce.

Burton, V., & Finnegan, T. 1991. Historians, supercomputing, and the U.S. manuscript census. Social Science Computer Review 9:1-12.

Cognitive Development Inc. (currently Microcase Corporation). 1990. Microcase analysis system. Seattle: Microcase.

Davis, J., Smith, T. W. 1991. General social surveys, 1972-1991: Cumulative codebook. Chicago: National Opinion Research Center.

Department of the Interior, Census Office. 1860. Instructions to U.S. marshals. Washington: Bowman.

Durkheim, E. 1897. Suicide: A study in sociology. Trans. J. A. Spaulding &. G. Simpson, 1951. New York: Free Press.

Hill, M. S. 1992. The panel study of income dynamics. Newbury Park, CA: Sage.

Keller-McNulty, S., & Unger, E. A. 1993. Database systems: Inferential security. Journal of Official Statistics 9:475-99.

Landale, N. S., & Guest, A. M. 1990. Generation, ethnicity, and occupational opportunity in late 19th century America. American Sociological Review 55:280-96.

Madron, T. W. 1985. Searching with soundex. PC Tech Journal, April, 163-68.

McLaughlin, M. P. 1989. Simulated annealing. Dr. Dobb's Journal 14 (9): 26-37.

Merkhofer, M. W. 1981. A technology assessment of public-key cryptography: Final report prepared for the National Science Foundation. Menlo Park, CA: SRI International.

Wagner, A. H. G. 1864. Die Gesetzmaessigkeit in den Scheinbar Willkuerlichen Menschlichen Handlungen vom Standpunkte der Statistik. Hamburg: Boyes und Geisler.