************************************************************************ Extraction of named entities and uncertainty in prosopographic databases ************************************************************************ ** Scientific context ** The DAPHNE ANR project aims at formalizing knowledge used by historians for prosopographic analyses (such as careers of historical figures), and at precising the conditions in which parts of the hypothesis development and validation may be automatized. The project tightly associates academic partners specialists in pattern mining, database querying, data quality in information systems, as well as historians studying academics from the 13th to 15th centuries. The Studium Parisiense (http://studium.univ-paris1.fr/) project contains more than 17,000 records with careers of academic figures. Those records are based on factoids, i.e., proto-facts mentioned by sources. When a factoid reaches a certain degree of reliability (for instance, because several sources support it), it is called a fact. Those data were recorded over a long period, with formating rules adapted to historical work. The first steps of the project focused on elaborating a data model to capture the main concepts from prosopographic work, and a set of rules to rate the reliability of sources and the credibility of information [2]. ** Project description ** The ISID team (http://cedric.cnam.fr/lab/equipes/isid/) has strong skills in modeling and data management and has already proposed a data model incorporating uncertainty for prosopographic databases. Initially, the post-doctorate will consist in finalizing the modeling proposed in [1, 3] in order to propose a method of representation and management of data adapted to the production and validation of historical knowledge, integrating the notion of uncertainty, credibility of sources and authors. The objective will then be to propose a method of recognition of named entities (NER) adapted to historical prosopographic data. However, the difficulty here is twofold. First of all, these data are by nature often imprecise/uncertain (end of the 13th century, Easter 1243, summer 1312, in Piemont, near Paris, south of the Duchy of Burgundy), incomplete (we do not know what a given university student did between two dates, we do not know in which university he studied theology), even inconsistent or unreliable. But moreover, most of the facts and places are temporally contextualized : for example, Flanders are independant in the 11th century while they were Burgundian in the 14th century, the territories of bishoprics and feudal authorities, moreover often formed of non-contiguous zones, etc. These two aspects make it difficult to identify named entities on this data using classic tools like CoreNLP, spaCy or NLTK. Finally, the last objective of this post-doctorate will consist in extracting and quantifying the uncertainty relating to the various data, including the information concerning the entities resulting from the identification technique proposed in the first part of the post-doctorate. This approach may, depending on the progress of the post-doctorate, be supplemented by the development of inference rules in order to evaluate the uncertainty of data related to those for which an uncertainty has been extracted. This uncertainty can then be used during subsequent phases of the Daphné project, in particular for questioning or extracting hypotheses. ** Candidate background ** • PhD in computer science • good level with knowledge databases (ontology, NLP, NER) • proficient programming skills (Java, Python) Feel free to apply even if you do not have precisely all the mentioned skills. ** Workplace, dates, salary ** • The candidate will be integrated in the ISID team (CÉDRIC lab at CNAM Paris), under the superivision of Cédric du Mouza, Jacky Akoka and Isabelle Wattiau. Offices are located in the center of Paris. Due to the ongoing pandemic, the beginning of the contract (at least) will be potentially be remote. • One-year contract, with a starting date in September 2021, if possible. • The net salary will be around 2,500 euros per month. ** Contact ** Please send your application with a resume at the following addresses : dumouza@cnam.fr, jacky.akoka@lecnam.net and isabelle.wattiau@essec.edu. ** Références ** [1] J. Akoka, I. Comyn-Wattiau, S. Lamassé, and C. du Mouza. Modeling historical social networks databases. In T. Bui, editor, 52nd Hawaii International Conference on System Sciences, HICSS 2019, Grand Wailea, Maui, Hawaii, USA, January 8-11, 2019, pages 1–10, 2019. [2] J. Akoka, I. Comyn-Wattiau, S. Lamassé, and C. du Mouza. Contribution of conceptual modeling to enhancing historians’ intuition - application to prosopography. In Conceptual Modeling - 39th International Conference, ER, 2020. [3] J. Akoka, I. Comyn-Wattiau, S. Lamassé, and C. Du Mouza. Conceptual modeling of prosopographic databases integrating quality dimensions. Journal of Data Mining and Digital Humanities, 2021.