Internship proposal:

 Subject: Interactive Generation of Statistically Realistic Datasets.
 Level: masters student or post-doc
 Location: Paris, France
 Expected Duration: 6 months (starting around March would be ideal, some 
 flexibility is possible).
 Expected Background: Information Visualization, HCI, Programming skills 
 needed for developing an interactive prototype A surface knowledge of 
 Bayesian Statistics will be useful.

 Subject description:

 The data sets for which knowledge discovery (i.e. business analytics) 
 solutions bring the highest returns on investment and usefulness are of a 
 very high sensitive nature, for both privacy and business reasons. Access 
 to this data has to be highly restricted, if only as a measure of 
 precaution. In consequence, the lack of a feedback loop between software 
 evolution and actual usage patterns makes it extremely difficult to create 
 appropriate information management solutions to extract the best value 
 from these datasets. This conundrum does not affect just the research 
 community, but, in general, all the professionals in charge of creating 
 software solutions to manage these highly confidential assets. Two 
 potential research directions emerge to solve this issue: anonymization of 
 datasets, and the generation of statistically realist datasets. In light 
 of the debacle of the release of AOL search data, we discard the first and 
 propose to work on the second.
 We propose an internship to work on providing a dataset generator of 
 highly realistic, statistically sound, but totally artificial data. The 
 generator we propose to create would take as input a data-model (UML, set 
 of classes, database schema?), some constraints on this data model, and 
 statistics on the conditional distributions of variable values in this 
 model. It would generate a population of randomly generated data following 
 constraints and distributions.
 From our current experience in this area, and related work, we see that 
 the challenge behind this problem lies in providing ways to assess the 
 ecological validity of an intermediate solution, then provide interactive 
 ways to refine the generation model with additional rules or statistics to 
 improve its quality. Hence, the subject of the internship is principally 
 focused on providing the right interface and tools to support this 
 iterative design process, rather than focus on the statistical aspects of 
 realistic data generation, which are somewhat simple.

 Related bibliography:
 - Baudel, T.  From information visualization to direct manipulation: 
 extending a generic visualization framework for the interactive editing of 
 large datasets. In Proc. ACM UIST?06, pages 67?76, 2006
 - Mark A. Whiting, Jereme Haack, Carrie Varley, Creating Realistic, 
 Scenario-Based Synthetic Data for Test and Evaluation of Information 
 Analytics Software, BELIV?08, April 5, 2008, Florence, Italy. ACM Press.
 - Barbosa, D., Mendelzon, A., Keenleyside, J., and Lyons, K. ToXgene: a 
 template-based data generator for XML. In Fifth International Workshop on 
 the Web and Databases, (2002).
 - Gray, J., Sundaresan, P., Englert, S., Baclawski, K., and Weinberger, P. 
 J. Quickly generating billion-record synthetic databases. In Proc SIGMOD, 
 ACM Press, (1994), 243-252.
 - Theodoridis, Y., Silva, J. and Nascimento, M. On the Generation of 
 Spatiotemporal Datasets, Proc. Symp. Large Spatial Databases (SSD), 
 (1999), 147-164.
 - GS Data Generator. http://www.GSApps.com.
 - Additional internal work.