International Network for Integrated Social Science

William Sims Bainbridge, National Science Foundation*

Social Science Computer Review, Vol. 17 No. 4, Winter 1999, pp. 405-420.

Abstract

Computer-related developments across the social sciences are converging on an entirely new type of infrastructure that integrates across methodologies, disciplines, and nations. This article examines the potential outlined by a number of conference reports, special grant competitions, and recent research awards supported by the National Science Foundation. Together, these sources describe an Internet-based network of collaboratories combining survey, experimental, and geographic methodologies to serve research and education in all of the social sciences.

Keywords: data archives, geographic information systems, web-based surveys, web-based laboratories, collaboratories, computational social science

International Network for Integrated Social Science

The evolution of the social sciences has reached a crucial point that might be called a phase change in which old, atomistic, and impressionistic ways of doing research are superseded by a far more systematic and unified methodology. To bring social science to the level of rigor already achieved by some of the physical sciences, a new type of facility will be needed. This will be a transdisciplinary, Internet-based collaboratory that will provide social and behavioral scientists with the databases, software and hardware tools, and other resources to conduct worldwide research that integrates experimental, survey, geographic, and economic methodologies on a much larger scale than was possible previously. This facility will enable advanced research and professional education in economics, sociology, psychology, political science, social geography, and related fields.

In many branches of social science, a new emphasis on the rigor of formal laboratory experimentation has driven researchers to develop procedures and software to conduct online interaction experiments using computer terminals attached to local area networks (LANs). The opportunity to open these laboratories to the Internet will reduce the cost per research participant and increase greatly the number of institutions, researchers, students, and research participants who can take part. The scale of social science experimentation can increase by an order of magnitude or more, examining a much wider range of phenomena and ensuring great confidence in results through multiple replication of crucial studies.

Technology for administering questionnaires to very large numbers of respondents over the Internet will revolutionize survey research. Data from past questionnaire surveys can be the springboard for new surveys with vastly larger numbers of respondents at lower cost than by traditional methods. Integrated research studies can combine modules using both questionnaire and experimental methods. Results can be linked via geographic analysis to other sources of data including census information, economic statistics, and data from other experiments and surveys. Longitudinal studies will construct time-series comparisons across data sets to chart social and economic trends. Each new study will be designed so that the data automatically and instantly become part of the archives, and scientific publications will be linked to the data sets on which they are based so that the network becomes a universal knowledge system.

The expertise, hardware, databases, and user communities that will make this facility possible are located at many different universities, archives, and government agencies. The Internet, however, can bind them together into a unity. Various sites in the collaboratory will have special responsibilities for collecting, archiving, analyzing, cataloging, retrieving, and integrating social, behavioral, and economic data. All must be linked through shared protocols on data transfer and cost recovery so that the resources at any one location are available to students and researchers at all other locations. New tools for analysis and visualization of data will be required, as will a range of new data collection methods. Like a great telescope, this facility will provide the physical infrastructure and organizational framework for research by scientists from many universities. Its distributed archive will provide data and statistical analysis tools to researchers, students, and policymakers everywhere.

To determine the scientific needs and opportunities in this general area, the National Science Foundation (NSF) Directorate for Social, Behavioral, and Economic Sciences (SBE) has supported many varied workshops, bringing together social scientists and experts on information technologies to advise the NSF as it plans its investments. Already back in June 1995, SBE held a workshop at the San Diego Supercomputer Center called "Connecting and Collaborating" (see Appendix), which focused on the implications of the Internet for international scientific collaboration. The workshop report observed, "As computer-mediated communication makes physical separation less and less of a barrier to international collaboration, a 'globalization' of formerly parochial disciplines occurs." This means not only that sociologists from different nations can collaborate but also that the boundaries separating sociology from other disciplines such as political science, economics, and anthropology have begun to dissolve. The workshop participants recognized that the new technology not only permits but practically demands the integration of the social sciences across both nations and disciplines. The greatest minds always have recognized that there should exist a single unified social science, united in the diversity of theories and methodologies it employs. In the 21st century, this dream can be brought to reality (Romans, 1967).

Web-Based Data Archives

Large social science databases are unusual in that they generally are capable of serving many researchers beyond those who originally collected the data. This is to say that "secondary analysis" of existing data sets is a major method of gaining scientific knowledge in economics, sociology, political science, anthropology, and other social and behavioral sciences. Therefore, SBE has long placed a high priority on preserving such data and making them widely available to scientists. The NSF archiving policy for quantitative social and economic data stated, "For appropriate data sets, researchers should be prepared to place their data in fully cleaned and documented form in a data archive or library within [1] year after the expiration of an award."[1]

The largest number of NSF-supported social science data sets have been archived at the Inter-University Consortium for Political and Social Research (ICPSR), located within the Institute for Social Research at the University of Michigan. As its web site explains, the ICPSR is a

membership-based, not-for-profit organization serving member colleges and universities in the United States and abroad. [The] ICPSR provides access to the world's largest archive of computerized social science data, training facilities for the study of quantitative social analysis techniques, and resources for social scientists using advanced computer technologies.[2]

During recent years, the ICPSR has provided data over the Internet by file transfer protocol (FTP) free to member institutions and for a moderate fee to others. Much effort has gone into producing electronic versions of the codebooks that describe all the variables in the most important data sets, but the process of obtaining data remains cumbersome and time-consuming. It is not possible, for example, to browse online across multiple data sets looking for particular data, nor is it possible to combine information from several data sets without first obtaining copies of all of them and writing elaborate software routines from scratch.

The ICPSR is by no means the only social science archive, and several others have received NSF support. For example, the Roper Center was established at the University of Connecticut with the help of a grant from the NSF and has been the repository for the long-running NSF-supported General Social Survey (GSS), which is the foundation for the International Social Survey Program.[3] Over the years, SEE has supported the creation of public use samples from many of the decennial U.S. censuses, and the Social History Research Laboratory at the University of Minnesota has created a system for integrating them and distributing the data freely to anyone over the Internet.[4] The Henry A. Murray Research Center of Radcliffe College received NSF support to create a collection of racially and ethnically diverse behavioral science data sets.[5] The NSF contributed to the National Longitudinal Study of Adolescent Health, which was chiefly supported by the Demographic and Behavioral Sciences branch of the National Institute of Child Health and Human Development. A public use extract of the data is available from the Sociometrics Corporation, which serves as a commercial archive for several data sets.[6]

Recently, several researchers have simply placed their data on the Internet from their universities' own servers. For example, the data from the NSF-supported Russia Longitudinal Monitoring Survey are available directly from the University of North Carolina.[7] The net result is that data are available from so many different sites, in so many different formats, and under so many different financial arrangements that it requires great tenacity to locate and obtain data for secondary analysis. Specialists might have both the knowledge and motivation to get the chief data sets in their own narrow area, but the current situation inhibits interdisciplinary research and severely discourages use of the data by students, policymakers, journalists, and the general public.

Many users of social scientific data and the archive managers themselves have long recognized the high desirability of developing a web-based system for providing social science data swiftly and flexibly to a wide range of users. On June 18, 1996, the prototype General Social Survey Data and Information Retrieval System was launched on the World Wide Web by the ICPSR, the National Opinion Research Center of the University of Chicago, and the Computer-Assisted Survey Methods Program at the University of California, Berkeley, supported by SBE in collaboration with the NSF's Directorate for Computer and Information Science and Engineering and its Directorate for Education and Human Resources.[8] Anywhere in the world, students and researchers can, without cost, access and analyze data from more than 35,000 questionnaire respondents at this site. Well-organized codebooks list the 3,000 questions that have been included in the GSS since its inception in 1972, and a hyperlink goes from each item to tables of data and to abstracts of any of the 3,000 GSS-related publications that used the item. Statistical analysis adequate for most purposes can be done online, and researchers who need to use their own software for more advanced analyses can download the data. This site receives very heavy use and demonstrates both the scientific and educational values of such a service. However, the GSS is only one of many data sets that need to be provided, and the Web-based system still is very primitive in many respects.

The next step in the development of a comprehensive system for providing social survey data is to build a prototype with multiple data sets, multiple sites, and sophisticated search technology. The NSF-led, multi-agency Digital Library Initiative has taken the first step through a $1.8 million grant to Sidney Verba and colleagues at Harvard University to create an operational social science digital data library. This "Virtual Data Center" (VDS) will develop and demonstrate robust scalable methods for linking multiple distributed collections of social science data including creation of system protocols and modular software that will be made freely available. The VDS also will be an applications testing ground, evaluating and improving the capability of digital library technology to serve a variety of users simultaneously. These users include "fact seekers" who need particular pieces of information that might be strewn across several unfamiliar data sets, teachers and students involved in social science education at all levels, and advanced researchers.

The VDS is an outgrowth of the system that Harvard already has created for its own campus, and the distributed survey archive of the future might be located on hundreds of campuses around the world rather than concentrated at the ICPSR or another central archive. An institution like the ICPSR still might be necessary to coordinate efforts, catalog data sets, and lead efforts in survey research education and methodological development. If researchers use modem software and properly document their data sets while they are creating them, then placing the data on the web will be a trivial task. Long-term preservation of data might require mirror site arrangements among groups of universities and a few physically secure repositories. But these are organizational rather than technical problems, and the VDS project will begin to solve them. The chief technological challenges related to survey archives concern data reliability, confidentiality, and retrieval.

The fiscal year 1999 SBE special competition announcement, Enhancing Infrastructure for the Social and Behavioral Sciences, included among its four funding thrusts "Web-based data archiving systems that enable worldwide access to linked databases and that incorporate innovative capabilities for metadata, file searching, and data confidentiality protection" (NSF, 1999a). If an efficient, comprehensive, web-based archive for social and behavioral science data existed, then new research could be solidly based on earlier studies and fresh data could be linked to the existing databases for comparison and calibration.

Equally important will be the development of technology, commonly accepted standards, and social organizations for archiving diverse types of data that go beyond traditional textual and numerical forms. Some progress already has been achieved in the area of linguistics, for example, through the Child Language Data Exchange System at Carnegie Mellon University, which has been developing a cooperative database of children's speech, methods for computer analysis of transcripts, and systems for linking transcripts to digitized audio and video.[9] Very recently, samples of recorded human speech have become available over the web from the Linguistic Data Consortium and the Phonological Atlas of North America, the latter of which keys the samples to clickable maps of the speakers' locations in the United States, vividly demonstrating the geographic variations in how Americans talk.[10] Carnegie Mellon's Informedia, one of the original six projects of the Digital Library Initiative, pioneered integration of speech, images, and text in digital video libraries.[11] A major new digital library for the spoken word will be announced shortly.

Web-Based Surveys

The widespread adoption of the Internet and the web makes it possible to administer questionnaire surveys electronically, potentially achieving much greater cost-effectiveness and permitting the integration of data from many sources. At the same time, there are significant technical challenges that must be met, especially in the areas of logistics and sampling. Recognizing the need for innovation in this and related areas, the NSF Methodology, Measurement, and Statistics Program, in collaboration with a consortium of federal statistical agencies represented by the Interagency Council of Statistical Policy and the Federal Committee on Statistical Methodology, has held a special competition on survey research methods. Included among the topic areas in the competition announcement is "secure and easy-to-use methods of collecting survey data via the Web" (NSF, 1999b).

Today's leading social scientific surveys are very expensive interview studies of national samples. For example, the GSS administers a 90-minute face-to-face interview to 1,500 American adults at a cost of about $500 per interview. However, the respondents are not a true random sample because cost considerations with respect to the interviewer's travel require that respondents be recruited in a limited number of geographic clusters, and there is no list of residents from which a random sample could be drawn. The small number of geographic areas surveyed limits scientists' ability to link GSS data to other geographically based data such as the U.S. census. Because of the high cost and the many research communities that seek time in the GSS, it is impossible to include more than a handful of questions on any particular topic. This prevents the GSS from employing much of the best methodology of measurement scale construction, which requires inclusion of a large number of items. Surveys like the GSS will be needed in future decades to chart the changing social, economic, and political conditions of the American public. But many types of social science will advance more rapidly through surveys administered over the web.

Web-based surveys can reach very large numbers of respondents at low cost. They will be geographically dispersed so that their data can be linked to the census, to local economic information, and to data from other web-based surveys. It might not be possible to hold the interviewees' interest for the full 90-minute questionnaire of the GSS, but shorter duration surveys administered to very large numbers of respondents can in aggregate include far more items, thereby permitting much finer measurement of scientifically interesting variables. The high cost of major national surveys generally has restricted the topics studied to those that especially require highly representative samples such as family structure and economic status in the Panel Study of Income Dynamics[12] and voting behavior in the American National Election Study,[13] data from both of which are now freely available over the web. A vast array of other scientific research areas, therefore, have languished for many years without the large-scale survey data that would permit knowledge to progress.

Although conventional surveys such as the GSS do not achieve true random samples, they are more representative than the populations who currently would respond to a web-based survey. Methodologies must be developed to calibrate the respondent pools of web-based surveys to simulate as closely as possible a true random sample, for example, by inclusion of demographic and other variables that can be used in statistical weighting of responses. In addition, the objectivity of experimental methods can be employed, giving randomly selected subsets of respondents somewhat different stimulus items and measuring their alternative reactions. Indeed, computer techniques make it possible to give each of thousands of respondents a uniquely different survey compounded of a subset of a large collection of items and then to analyze items across respondents to identify commonalties and patterns that would not be visible in a traditional survey project.

One drawback of existing national samples is that only small fractions of the population often have the personal characteristics, life experiences, and/or levels of education that would make them appropriate respondents for many types of surveys. For example, research supported by SBE's Science Resources Studies division indicates that only about 14% of Americans are attentive to issues concerning science and technology (National Science Board, 1998, p. 7). A general-purpose web-based survey system could recruit large numbers of respondents, give them a few preliminary questions to categorize them, and then allocate them in real time to the specific studies for which they are the most appropriate respondents. The issue of generalizability of results can be addressed not by investing in prohibitively expensive random samples but rather by replicating results across a variety of subsets of respondents who are likely to have different patterns of response to the key items. When significant differences in results are found across subpopulations, further research can be done to discover the factors responsible for the variations and to explain them theoretically.

Some researchers are conducting surveys both on the web and by other means, thereby making it possible to compare results across methodologies. For example, Trudy Ann Cameron at the University of California, Los Angeles, is doing a web-based survey of beliefs about future climate change in collaboration with faculty collaborators at several universities who will recruit student respondents as well as a conventional mailed survey of the general population.[14] A team headed by David Weimer at the University of Rochester is evaluating the potential of Web-based surveys in comparison with those done by telephone in a study of people's willingness to pay to reduce emissions of greenhouse gasses.[15]

Probably the most extensive web-based survey demonstration project yet completed is James Witte's Survey 2000, administered on the web in November 1998 by the National Geographic Society. More than 50,000 adults and 16,000 children completed one or another version of this complex survey, which focused on geographic migration, regional culture, social environment. Internet use, interests, and attitudes. Most respondents were residents of the United States or Canada, and the geographic database includes their detailed postal codes, but at least 100 respondents came from 1 of 33 other nations.

The computers administering Survey 2000 followed extremely complex instructions that used respondents' answers to early questions to determine which later questions to ask, for example, obtaining a mobility history throughout the person's life and collecting data about preferences for foods and musical styles that belonged to the respondent's region of birth or current region of residence. In addition, one or another of four major topical modules was included at random. The ability of surveys to employ audiovisual stimuli is demonstrated by the fact that many respondents had the opportunity to respond to questions about clips of music that were transmitted. The survey was specifically designed to be linked to other geographically based data, and the National Geographic Society plans to place the data on the web in a system that would make it valuable for students and researchers everywhere.

Computer administration of questionnaires permits the researcher to collect information of many types that cannot be obtained through paper surveys or telephone interviews. For example, questions can incorporate graphics, photographs, sounds, and moving images, and the computer can measure nonverbal behavior of the respondent such as the time it takes to react to various stimuli. This last point is the fundamental principle of the Implicit Association Test, developed by Anthony Greenwald and Mahzarin Banaji.[16] Twin web sites demonstrating this methodology, at the University of Washington and Yale University, have received more than 200,000 "visits." The researchers note that people often are unaware of important components of their own cognitive processes, so these phenomena cannot be studied effectively through conventional self-report questionnaire items. Web-based demonstration tests using the new implicit association technology have explored its value for research on social stereotypes of age, race, and gender; on attitudes toward oneself; and on preferences for academic subjects such as mathematics.

A form of research that is methodologically related to surveys is content analysis of material that already is flowing freely over the Internet or that can be obtained for a fee through various online data services. Political scientist Joshua Goldstein has employed machine coding of news stories carried by the Reuters news service to study the actions of national leaders in regional conflicts such as that in Bosnia.[17] Philip Schrodt and Deborah Gerner are doing similar research using machine coding of Reuters reports of political events in the Middle East.[18] Sociologist J. Craig Jenkins has used automatic event coding to analyze text from Reuters.[19]

All of the new Internet-based research methodologies require calibration studies to determine their biases and to find methods to compensate for them. In wide-ranging research on political demonstrations in the United States, Germany, and Belarus, John McCarthy, Clark McPhail, and Pamela Oliver examined the factors that determined the extent and type of media coverage devoted to such events.[20] Once appropriate validation and calibration procedures have been developed, information from news services and other sources concerning events can be attached to the geographic location where they took place for combination with data from opinion surveys, economic indicators, and even broadly based experiments on people's actual behavior.

Web-Based Interaction Collaboratories

The importance of rigorous experimental methods has long been recognized in several of the social sciences, but very recently two developments have given this type of research a fresh urgency. First, the development of new technologies has rendered large-scale experimental research far more cost-effective and feasible for a much wider range of scientific questions than in the past. Second, throughout the social sciences, there is a widespread recognition that it is necessary to go beyond the speculative or exploratory work of the past to achieve a high degree of reliability in research findings, which in turn requires improved rigor and greater attention to replication of results by multiple teams and methodologies.

Many of the most challenging social scientific questions involve interactions among larger numbers of people than can be accommodated in conventional laboratories including markets, election systems, and social networks. In addition, the educational value of social science laboratories is limited by the fact that few universities have them. Both the replication and extension of experimental results are hampered by the small number of research teams. Now, it is possible to build Internet-based distributed laboratories, which require new software and organizational frameworks. By these means, experiments now can be scaled up to include hundreds or even thousands of participants. Future "Netlabs" will cross national and cultural boundaries, bringing new population samples into the laboratory. The time duration of many experiments can be greatly expanded to cover evolutionary processes never before studied. Perhaps most important, laboratory experimentation now can become part of the routine education of undergraduates in the social sciences. This Net-based research will have the added benefit of allowing scientists to compare how people interact electronically to how they behave in conventional laboratory settings, thus examining the distinctive qualities of electronic communication through highly rigorous research.

Researchers in several disciplines have pioneered the construction of modest experimental laboratories to study social interaction that are based on LANs of computers. For example, teams of sociologists at the University of South Carolina (led by David Willer) and the University of Iowa (led by Barry Markovsky) have collaborated on several projects developing and testing mathematically based theories of how the structure of social networks confers power on the people occupying certain locations in those networks. Each university has a very modest laboratory for running experiments on social interaction among small numbers of human participants, but the current technology does not permit them to link the two laboratories in real time or to open up the experimental systems to include large numbers of research participants arranged in realistically complex social networks. Over a decade, SBE invested $1,052,033 in a dozen peer-reviewed awards[21] to conduct this scientifically excellent pioneering research in this pair of laboratories. Then in 1999, an award of $ 1,199,215 was made under the Digital Library Initiative to support development of a prototype Internet-based laboratory for social interaction research.[22]

Another extremely important area for research is the functioning of various types of economic markets. SBE awards to economist Charles R. Plott, of the California Institute of Technology, totaling $1,204,974 have supported development of a laboratory capable of exploring the properties of experimental markets that are larger and operate over a longer time scale than those studied previously.[23] This work is establishing the knowledge base required to scale such experimental research up to much larger and more complex systems that are far more realistic models of securities and commodities markets in the real world.

Several other researchers also are exploring this promising approach. David Lucking Reiley has been comparing the dynamics of Internet-based auctions employing different formats and contrasting them to results from laboratory studies of face-to-face auctions.[241] Robert Mauro is developing an Internet-Based Decision Research System to carry out experiments at multiple remote locations.[25] Daniel McFadden is experimenting with the effect of anchoring and focal points in the response choices offered in Internet-based surveys, and with Paul Ruud he also has been pioneering the online distribution of social science software.[26]

To determine the scientific needs and opportunities in this rapidly developing field, SBE sponsored a workshop in October 1997 titled "NetLab" (see Appendix). A total of 20 social and behavioral scientists who are experts in this area combined their knowledge and produced a comprehensive report.[27] They agreed that the new computing and communications technologies permit experiments to be scaled up to include as many as 1,000 or more human participants. For example, courses at several colleges could link the laboratory sessions of their social psychology courses so that hundreds of students were engaged in the same experiment on all their campuses simultaneously. Other experiments could be run with hundreds of ordinary citizens over the web, interacting simultaneously in a realistic economic market or a simulated political election campaign. Corporations frequently invite researchers to conduct studies in management science or the sociology of organizations, often involving many hundreds of employees, and they often might find it more efficient to set aside a particular block of time when the selected employees participate simultaneously over the LAN using their office computers.

The NetLab workshop gave equal importance to the capability of web-based laboratories to reach across national boundaries, drawing people from a variety of cultures and populations into the research. A severe limitation of much conventional social science laboratory research is that the human participants often are highly homogeneous in their characteristics - proverbially middle-class college sophomores - so that the generalizability of the results to other groups is in severe doubt. In addition, because so few laboratories and teams currently exist, important research results seldom are verified in fresh studies, and Netlabs would greatly facilitate replication. Workshop participants also envisioned experiments of much longer duration, examining dynamic processes of change, as the Internet made it much more convenient to hold multiple experimental sessions with the same research participants. The Netlab has great educational potential. Advanced undergraduate and graduate students would have the crucial educational experience of setting up the local portion of the collaboratory, recruiting research participants, and analyzing the prompt results that can be available immediately after conclusion of the data collection sessions.

The NetLab workshop report also recognized that several major challenges must be met to bring these visions to reality. New software must be written that will run reliably on various platforms at multiple locations. Workshop participants enthusiastically advocated development of a universal modular software system that would facilitate running experiments with a wide range of designs, vastly reducing the effort required to prepare for a new experiment. There was a consensus that "large-scale centers and 'collaboratories' are necessary to help push the development of these new tools and approaches." As the physical hubs of the distributed Netlab, these centers would require special hardware and also would take on training and technical support functions. The report concluded that "social science Netlabs will have a massive impact on knowledge development and instruction."

One of the goals of SBE's special infrastructure competition is to "create Web-based collaboratories to enable real-time controlled experimentation, to share the use of expensive experimental equipment, and/or to share widely the process and results of research in progress" (NSF, 1999a). Whether the fiscal year 1999 or 2000 competitions support Netlab development will depend on the outcome of the regular NSF peer review process, but by listing experimentation collaboratories as a priority, SBE has recognized the importance of these facilities across numerous fields.

Social Science Geographic Information Systems

During recent years, there has been tremendous progress in computerized display and analysis of geographic data. Largely in the private sector, geographic information systems (GIS) have become a major industry, and rigorous methods of geographic information analysis (GIA) have developed rapidly in academia. These two parallel lines of development need to be combined, and GIS needs to transcend its emphasis on physical geography by including a greater number and variety of social variables and techniques of analysis. Perhaps the most fruitful realm for accomplishing this is in the linkage of geographically based data from multiple sources, permitting analysis at both the level of the geographic unit (community, city, state, nation) and the level of the individual human being.

It is worth distinguishing two very general social science conceptualizations of the linkage between geographically based data and data about individual human beings. First, geographic data can describe the environment in which the individual lives. For example, the GSS has long measured White people's attitudes toward African Americans, but respondents live in very different racial environments, some coming into constant contact with minorities and some living in parts of the country where there simply are very few minorities with whom to come into contact. Logically, the racial mix in the community should significantly influence racial attitudes, so it was an important step when researchers such as Marylee C. Taylor began adding to the GSS data set information from the census about the percentage of the local population who were African Americans.[28] Similarly, census data about the occupational structure of the local economy can illuminate the career options open to GSS respondents.

The second way in which geographic data can be combined with individual data is to treat the geographic data as an indirect but useful source of information about the individuals. For example, many studies have found that crime rates (notably larceny) are higher in communities with high rates of residential moving. It would be wrong simply to conclude that all thieves are transients, but it is reasonable to infer that they tend to have some characteristics related to transience such as weak social bonds.

A classical challenge in this type of research is called the ecological fallacy, which is the mistake of assuming that individuals have the typical characteristics for their areas. Theorists and methodologists have developed ways of meeting this challenge. First, we can recognize that all scientific findings are somewhat tentative, and we always must be looking for alternative explanations and fresh ways of testing the ideas empirically. Another approach is to recognize that the ecological fallacy usually is just a special case of the problem of spuriousness, the distorting effect of an unmeasured variable on the correlation between two other variables (Bainbridge, 1992, pp. 385-386, 452-453). The solution, then, is to make sure that we measure as many of the relevant variables as possible, which in this context we can do by combining several geographically based data sets. A host of multivariate statistical techniques already is available to help analyze geographically based data accurately, and new methods undoubtedly will be created if we invest seriously in developing this area of research.

In June 1996, representatives of the 29 research institutions that had formed the University Consortium for Geographic Information Science met to set research priorities in their field. Among these priorities were the societal implications of GIS, and the published report of this conference notes the great significance of the new technology for social science disciplines as varied as demography, econometrics, and political science. While listing unanticipated impacts of GIS on society, the report stated, "GIS can be used to link together digital street maps and telephone directories so that the telephone number of a house can be found by pointing to the image of the house on a computer screen. Marketing campaigns can be targeted to the imputed socioeconomic status of each household" (University Consortium for Geographic Information Science, 1996, p. 117). Whatever social benefit or risk these technical possibilities will raise when used in commercial applications, GIS offers entirely new possibilities to the social sciences, especially when used to combine detailed data from different sources.

The European Science Foundation sponsored a series of 12 workshops on different aspects of GIS, culminating in a conference held near Strasbourg, France, in September 1997. "Geographic Information Research at the Millennium" (see Appendix) stressed that interoperability across different systems and types of data was a very important new area for research. The published report noted, "In the social sciences in general, and in quantitative geography and regional economics in particular, a wide range of spatial models has been developed in the past decades for the purposes of describing, analyzing, forecasting, and policy appraisal of economic developments within a set of localities or regions." However, these methods have not been incorporated into GIS software, and across the social sciences many powerful statistical techniques exist but are not available in a practical form to users of GIS. Also crucial is the development of new techniques of statistical analysis and data modeling that will take advantage of the new computing power.

SBE has sponsored workshops on the application of geographic analysis in two substantive social science areas: (a) democratization and market transition and (b) human capital. These areas earlier had been identified by other workshops as very high priorities for the social sciences. The first of these topics concerns the huge changes taking place in state socialist societies such as the former Soviet Union and China as well as in other countries undergoing dramatic upheavals. As the original democratization workshop report noted, "The political, social, and economic transformations currently under way in the world provide a veritable social scientific laboratory for research on the dynamics and probable consequences of these global transformations" (NSF, 1993-1994).

The NSF-sponsored workshop, "Geographic Approaches to Democratization" (see Appendix), identified many needs for large, international databases in three general domains: (a) territoriality, (b) spatial structures and flows, and (c) human-environment relations. Examples of territoriality include changing national boundaries and political systems. One of the most striking spatial flows during recent years has been international migration. Among the crucial needs for vastly increased information and advanced systems for analyzing it is the complex pattern of natural resource scarcity and abundance across regions, nations, and communities. Many crucial research topics cross all three domains and require greatly improved information systems, for example, the interplay among political transformation, migration, and natural resources in relation to the rate of economic development.

Human capital sometimes is described as the skills and knowledge that human beings learn that make them valuable in the workplace, but the term also includes the value-enhancing aspects of individuals' social network and of the surrounding moral order. The original human capital workshop identified research needs in six areas that are of great substantive importance to the nation: (a) employing a productive workforce, (b) educating for the future, (c) fostering successful families, (d) building strong neighborhoods, (e) reducing disadvantage in a diverse society, and (f) overcoming poverty and deprivation (NSF, 1994).

The report of the 1995 workshop on "Geographic Information Analysis and Human Capital Research" (see Appendix) noted, "There is virtually no limit to the volume and variety of data that can be linked using space as the reference grid." At the same time, this workshop's participants agreed with those of the conference held by the European Science Foundation that the "coupling of GIS with more powerful statistical routines characteristic of GIA is yet in its infancy" (p. 7). Although noting methodological challenges such as the ecological fallacy, the report confidently asserted, "Geographic space can be utilized in social science research as a proxy for association or relatedness among social phenomena" (p. 8). To illustrate the range of human capital issues that can be studied geographically, the workshop examined scientific research that had been performed on three specific topics: (a) the relationship between human capital and the distribution of violent crime; (b) the role that neighborhoods and social networks play in creating and sustaining human capital; and (c) the interrelations of migration, demographic change, and human capital.

A scientific workshop examining geographic information science and geospatial activities at the NSF, "Geographic Information Science: Critical Issues in an Emerging Cross-Disciplinary Research Domain" (see Appendix), reported,

Scenarios for geographic information use in the year 2010 suggest great potential to extend the capabilities of scientific researchers, decision makers, and the public. This potential, however, will only be realized if there are substantial advances in [GIS], enhancing knowledge of geographic concepts and their computational implementations.... Information technology, communications infrastructure, microelectronics, and related technologies could enable unprecedented opportunities for discovery and new ways to do research.

The NSF has supported much cutting-edge research and development on GIS and related educational curriculum development. Notably, the NSF Geography and Regional Science Program has supported the National Center for Geographic Information and Analysis (NCGIA), which is a model of a multisite consortium, linking researchers and facilities at the University of California, Santa Barbara; the State University of New York at Buffalo; and the University of Maine at Orono.[29] Working parallel to the NCGIA, the Alexandria Digital Library has pioneered work with spatially indexed information including collections of maps and digital images.[30] Most of this work, however, has focused on physical geography, and much work will be required to extend the techniques to social geography.

Data Security and Confidentiality

Among the chief challenges that must be met by any highly public system to distribute social science data is the strict necessity of protecting the rights of the human beings whom the data concern. According to the NSF (1995) Grant Policy Manual, "The grantee is responsible for the protection of the rights and welfare of human subjects involved in activities supported by NSF" (sec. 711.1). This responsibility does not end when the grantee transfers data from an NSF award to an archive. The chief safeguard traditionally has been review of proposed research projects by university-based human subjects committees, operating in accordance with federal regulations. These regulations specifically exempt from full human subjects review research involving the use of educational tests (cognitive, diagnostic, aptitude, achievement), survey procedures, interview procedures, or observation of public behavior, unless (i) information obtained is recorded in such a manner that human subjects can be identified, directly or through identifiers linked to the subjects; and (ii) any disclosure of the human subjects' responses outside the research could reasonably place the subjects at risk of criminal or civil liability or be damaging to the subjects' financial standing, employability, or reputation.[31]

Mindful of these principles, the principal investigators on major NSF-supported surveys (e.g., GSS, Panel Study of Income Dynamics, National Elections Study) have been very cautious about giving researchers access to variables in their data sets that might be used to identify individual respondents. The most sensitive variable is the precise geographic location of the respondent's residence, but this also is crucial information for linking individual data to data about the surrounding community. Traditionally, there have been two solutions to this problem: (a) strict procedures for providing selected researchers with limited access to "geocodes" or other sensitive data under controlled conditions and with clear legally enforced restrictions and (b) statistically aggregating the data or handling them in some other manner that prevents researchers from deducing the identities of individuals.

In partnership with the Bureau of the Census, the NSF has explored methods for expanding social science access to sensitive census data under controlled conditions by establishing the Boston Research Data Center.[32] An example of the type of scientifically meritorious research of great policy significance conducted at this center is a project examining the interrelationship of air quality regulations, factory productivity, and corporate decisions about where to locate factories carried out by J. Vernon Henderson.[33] Based on the success of the Boston Research Data Center, the NSF has supported the creation of similar centers in California (University of California, Los Angeles, and University of California, Berkeley) and another site in a different geographic region to be announced soon.[34] In addition, Carnegie Mellon University has set up a center used extensively by the NSF-supported National Consortium on Violence Research.[35]

An alternate approach is to forbid researchers from accessing the raw data directly but to provide them with the aggregate statistical results they actually need to do their science. For example, the NSF-supported Luxembourg Income Study (LIS) is an archive possessing the original data from studies of individual and family income in many nations.[36] A researcher might ask, for example, for a table of the correlations between level of education and income in all the nations. Personnel at the LIS would do the necessary computer runs and send only the summary results to the researcher. If the LIS personnel are alert and follow well-designed protocols, then this will prevent anyone from identifying particular individuals described in any of the data sets.

However, a determined and unscrupulous person hypothetically could submit a series of apparently innocent analysis requests, from which individual data could be reconstructed, using various statistical inference methods. This has been called the problem of "deductive disclosure" or "inferential security" (Keller-McNulty & Unger, 1993). To achieve the maximum potential of integrated Internet-based social science databases, research must be carried out on how to prevent inferential attacks on database integrity, and technology has to be developed to permit maximum use of the data by students and the general public while preserving confidentiality of individual identity. Integrating separate databases multiplies both the scientific value and risk that a determined attacker can compromise security by combining information from many variables.

Giving responsible social scientists greater access to sensitive government data can provide policymakers with much better information on which to base their decisions. For example, Lingxin Hao at Johns Hopkins University has obtained access to confidential data at the census bureau's site in the Washington, D.C., area to study the factors that determine the extent to which immigrants to the United States make use of public assistance when they run into economic difficulty."[37] To do this research, she must link data from the Survey of Income and Program Participation with the respondents' residential addresses to allow the addition of much information from the 1990 census about the social context in each individual's immediate geographic area.

Any security breach in the use of such sensitive data not only might harm the individuals described in the data but also could discredit the archive involved and possibly even the whole system of providing information confidentially to researchers. Therefore, early versions of integrated Internet-based social science databases, in which the unit of analysis is the individual respondent, might withhold some key variables except to approved researchers working at high-security sites. Some initial research might emphasize variables that respondents do not mind making public, for example, general attitudes and preferences. At the same time, sensitive information can be reported carefully in aggregated form in geographically based data sets, for example, in terms of rates per million population in states or metropolitan areas. Over time, further development of secure sites and of inferential security technology will expand the legitimate possibilities for data availability while fulfilling the duty of preserving individual confidentiality.

Conclusion

Research and development projects already supported by SBE and the Digital Library Initiative have established a foundation of practical experience and scientific expertise on which to build the integrated social science of the future. Although the NSF has long funded online data archives, it has only begun to provide support for web-based questionnaire surveys despite the great potential in this area. Although support has been provided for web-based experimentation, the range of scientific fields that could advance through development in this area is so great, and laboratory research designs are so varied, that additional projects are needed to run parallel to the few that already are funded. A major methodological gap is the fact that GIS have lagged behind in their application to social geography and in the development of software to carry out social science analysis. Centers have been established to develop procedures for sharing sensitive data with small numbers of researchers under secure conditions, but research has not been carried out on statistical and computational methods that would permit the most sensitive variables to be included in public use data systems. Practically nothing has been done as yet to integrate across the methodologies.

The International Network for Integrated Social Science is both technically feasible and scientifically necessary. Through it, social science can take its rightful place beside the natural sciences as a rigorous, objective discipline that is just as scientific as biology, physics, and chemistry. In the information society of the 21st century, the data contained in this international network of archives will be of great value to education, commerce, government policy, and the general enlightenment of the citizenry.

Appendix: Workshops and Conferences

Connecting and Collaborating
(implications of new communication technologies for international scientific collaboration)
San Diego Supercomputer Center
June 22-24, 1995
http://www.nsf.gov/sbe/sber/sociol/works2.htm

Geographic Approaches to Democratization
National Science Foundation
December 5,1994

Geographic Information Analysis and Human Capital Research
Boulder, Colorado
July 10-12,1995

Geographic Information Research at the Millennium
European Science Foundation Social Science Programme
Le Bischenberg, France
September 13-17, 1997
http://www.esf.org/

Geographic Information Science: Critical Issues in an Emerging Cross-Disciplinary Research Domain
National Science Foundation
January 14-15,1999
http://www.geog.buffalo.edu/ncgia/gisciencereport.html

NetLab
(Internet-based collaboratories for social interaction research)
National Science Foundation
October 30-31,1997
http://www.uiowa.edu/~grpproc/netlab.htm

Research Priorities for Geographic Information Science
University Consortium for Geographic Information Science
Cartography and Geographic Information Systems, Vol. 23,1996, pp. 115-127
June 1996
http://www.ucgis.org

Notes

1. NSF, Social Behavioral and Economic Research Archiving Policy, http://www.nsf.gov/sbe/sber/common/archive.htm.

2. http://www.icpsr.umich.edu.

3. NSF Awards 8618467, 9122133, 9122462, 9318653, 9511023, and 9617727. Information about these and awards cited in subsequent notes can be accessed at the NSF web site, http://www.nsf.gov, in addition to any other web addresses provided.

4. NSF Awards 9118299, 9210903, and 9422805.

5. NSF Award 9512010.

6. NSF Award 9527246, http://www.cpc.unc.edu/addhealth/.

7. NSF Award 9223326.

8. NSF Awards 9422556, 9422785, and 9800623.

9. NSF Award 9808974.

10. NSF Awards 9528587, 9528984, 9111637, 9222458, and 9811487.

11. NSF Award 9411299.

12. NSF Awards 9022891 and 9515005, http://www.isr.umich.edu/src/psid/.

13. NSF Awards 8808361,9317631, and 9707741, http://www.icpsr.umich.edu/nes/anesintro.html.

14. NSF Award 9818875.

15. NSF Award 9818875.

16. NSF Awards 9120987, 9205890, 9422241, 9422242, 9709924, and 9710172.

17. NSF Award 9617157.

18. NSF Award 9410023.

19. NSF Award 9710958.

20. NSF Awards 9122691, 9122732, 9320488, 9320704, 9511748, 9523439, 9601404, and 9632641.

21. NSF Awards 9871019, 9811323, 9515434, 9515364, 9423231, 9422974, 9223799, 9223688, 9109528, 9022935, 9010888, and 8808289.

22. NSF Award 9817518.

23. NSF Awards 9730176 and 9512394.

24. NSF Award 9811273.

25. NSF Award 9730581.

26. NSF Awards 9409005 and 9601149.

27. NetLab workshop report, http://www.uiowa.edu/~grpproc/netlab.htm.

28. NSF Award 9515187.

29. NSF Awards 8810917, 9321119, 9600465, 9602348, 9850595, and 9851550.

30. NSF Awards 9411330 and 9601954.

31. Federal Policy for the Protection of Human Subjects, 45 CFR, sec. 690.101 (1993).

32. NSF Awards 9311572 and 9610331.

33. NSF Award 9422440.

34. NSF Awards 9812173 and 9812174.

35. NSF Award 9513040.

36. NSF Awards 8801640, 9123675,9196010,9321507,9511521, and 9729762.

37. NSF Award 9819209.

References

Bainbridge, W. S. (1992). Social research methods and statistics: A computer-assisted introduction. Belmont, CA: Wadsworth.

Homans, G. C. (1967). The nature of social science. New York: Harcourt, Brace, & World.

Keller-McNulty, S., & Unger, E. A. (1993). Database systems: Inferential security. Journal of Official Statistics, 9, 475-499.

National Science Board. (1998). Science and engineering indicators 1998. Washington, DC: Government Printing Office.

National Science Foundation. (1993-1994). Democratization: A strategic plan for global research on the transformation and consolidation of democracies. Arlington, VA: Author. [Online]. Available: http://www.nsf.gov/sbe/sber/sociol/works4.htm

National Science Foundation. (1994). Investing in human resources: A strategic plan for the human capital initiative. Arlington, VA: Author. [Online]. Available: http://www.nsf.gov/sbe/sber/sociol/works1.htm

National Science Foundation. (1995). Grant policy manual. Arlington, VA: Author.

National Science Foundation. (1999a). Enhancing infrastructure/or the social and behavioral sciences (NSF Publication 99-32). Arlington, VA: Author. [Online]. Available: http://www.nsf.gov/cgi-bin/getpub?nsf9932

National Science Foundation. (1999b). Research on survey methodology (NSF Publication 99-35). Arlington, VA: Author. [Online], Available: http://www.nsf.gov/cgi-bin/getpub?nsf9935

University Consortium for Geographic Information Science. (1996). Research priorities for geographic information science. Cartography and Geographic Information Systems, 23, 117.

*William Sims Bainbridge is the science adviser to the Directorate for Social, Behavioral, and Economic Sciences of the National Science Foundation. He holds a doctorate in sociology from Harvard University and specializes in the sociology of religion, technology, computer software, and innovative research methodologies. He maybe contacted by e-mail at wbainbri@nsf.gov. His personal web site, the Question Factory (http://www erols.com/bainbri/qf.htm), is a system for creating new questionnaire research modules and software.

The views expressed in this article do not necessarily represent the views of the National Science Foundation or the United States.