Internship proposal: Subject: Interactive Generation of Statistically Realistic Datasets. Level: masters student or post-doc Location: Paris, France Expected Duration: 6 months (starting around March would be ideal, some flexibility is possible). Expected Background: Information Visualization, HCI, Programming skills needed for developing an interactive prototype A surface knowledge of Bayesian Statistics will be useful. Subject description: The data sets for which knowledge discovery (i.e. business analytics) solutions bring the highest returns on investment and usefulness are of a very high sensitive nature, for both privacy and business reasons. Access to this data has to be highly restricted, if only as a measure of precaution. In consequence, the lack of a feedback loop between software evolution and actual usage patterns makes it extremely difficult to create appropriate information management solutions to extract the best value from these datasets. This conundrum does not affect just the research community, but, in general, all the professionals in charge of creating software solutions to manage these highly confidential assets. Two potential research directions emerge to solve this issue: anonymization of datasets, and the generation of statistically realist datasets. In light of the debacle of the release of AOL search data, we discard the first and propose to work on the second. We propose an internship to work on providing a dataset generator of highly realistic, statistically sound, but totally artificial data. The generator we propose to create would take as input a data-model (UML, set of classes, database schema?), some constraints on this data model, and statistics on the conditional distributions of variable values in this model. It would generate a population of randomly generated data following constraints and distributions. From our current experience in this area, and related work, we see that the challenge behind this problem lies in providing ways to assess the ecological validity of an intermediate solution, then provide interactive ways to refine the generation model with additional rules or statistics to improve its quality. Hence, the subject of the internship is principally focused on providing the right interface and tools to support this iterative design process, rather than focus on the statistical aspects of realistic data generation, which are somewhat simple. Related bibliography: - Baudel, T. From information visualization to direct manipulation: extending a generic visualization framework for the interactive editing of large datasets. In Proc. ACM UIST?06, pages 67?76, 2006 - Mark A. Whiting, Jereme Haack, Carrie Varley, Creating Realistic, Scenario-Based Synthetic Data for Test and Evaluation of Information Analytics Software, BELIV?08, April 5, 2008, Florence, Italy. ACM Press. - Barbosa, D., Mendelzon, A., Keenleyside, J., and Lyons, K. ToXgene: a template-based data generator for XML. In Fifth International Workshop on the Web and Databases, (2002). - Gray, J., Sundaresan, P., Englert, S., Baclawski, K., and Weinberger, P. J. Quickly generating billion-record synthetic databases. In Proc SIGMOD, ACM Press, (1994), 243-252. - Theodoridis, Y., Silva, J. and Nascimento, M. On the Generation of Spatiotemporal Datasets, Proc. Symp. Large Spatial Databases (SSD), (1999), 147-164. - GS Data Generator. http://www.GSApps.com. - Additional internal work.