Within the ANR AI Chair project SourcesSay (, we are interested to hire a Ph.D. student with strong demonstrated abilities in machine learning and data management. The Ph.D. student should start in January 2021 (or as soon as possible thereafter) for a duration of 3 years, to perform research in the area of machine learning and data science applied to journalistic investigations. The Ph.D. student will develop novel methods in question answering to help journalists explore large heterogeneous datasets.                                                                                                                

    SourcesSay is a collaboration between Inria Saclay, the Le Monde French national journal, and the SME WeDoData.

    The Ph.D. student will work at Inria Saclay, which is the main partner of the ANR project.

    The student will be supervised by Oana Balalau and Ioana Manolescu. To apply please send a mail to Oana Balalau  ( with your CV, grades and if possible a letter of recommendation.

    In the SourcesSay project, we are interested in finding useful information in large datasets to provide support for investigative journalism.  Real-world events such as elections, public demonstrations, disclosures of illegal or surprising activities, etc. are mirrored in new data items being created and added to the global corpus of available information. Making sense of this wealth of data by providing a QA framework will facilitate the work of journalists.
     In our framework, we ingest any type of dataset (text, CSV, JSON, XML, RDF, PDF, and relational database) and organized the information as a heterogeneous graph, where nodes represent important pieces of information (such as entities in text, nodes in RDF, non-null attributes in tuples in relational data and so on) and edges represent a structural or semantic connection between the nodes. Our first goal is to take advantage of any existing structure to best answer questions over these datasets.  This will entail trying different representations of the data and/or trying to assign types to nodes, and normalize edge labels between nodes. Our second goal is to be able to answer with high confidence a wide variety of types of questions. The requirement for high confidence comes from the importance of the application, as journalists will use our framework to improve or speed up their work. Finally, our third goal is related to the need for transparency of sources under which journalists work. Any proposed solution for a QA system should be interpretable and open to scrutiny. Therefore, in addition to the answer to the question, the user will also have access to the explanation for the choice presented.

    The candidate should be proficient in written and spoken English. She or he should be motivated in doing research and have a good academic record.