Extraction of named entities and uncertainty in prosopographic databases

** Scientific context **

The DAPHNE ANR project aims at formalizing knowledge used by historians for prosopographic
analyses (such as careers of historical figures), and at precising the conditions in which parts of the
hypothesis development and validation may be automatized. The project tightly associates academic
partners specialists in pattern mining, database querying, data quality in information systems, as well
as historians studying academics from the 13th to 15th centuries.
The Studium Parisiense (http://studium.univ-paris1.fr/) project contains more than 17,000 records with careers of academic
figures. Those records are based on factoids, i.e., proto-facts mentioned by sources. When a factoid
reaches a certain degree of reliability (for instance, because several sources support it), it is called a
fact. Those data were recorded over a long period, with formating rules adapted to historical work.
The first steps of the project focused on elaborating a data model to capture the main concepts
from prosopographic work, and a set of rules to rate the reliability of sources and the credibility of
information [2].

** Project description **

The ISID team (http://cedric.cnam.fr/lab/equipes/isid/) has strong skills in modeling and data management and has already proposed a
data model incorporating uncertainty for prosopographic databases. Initially, the post-doctorate will
consist in finalizing the modeling proposed in [1, 3] in order to propose a method of representation and
management of data adapted to the production and validation of historical knowledge, integrating
the notion of uncertainty, credibility of sources and authors. The objective will then be to propose a
method of recognition of named entities (NER) adapted to historical prosopographic data. However,
the difficulty here is twofold. First of all, these data are by nature often imprecise/uncertain (end
of the 13th century, Easter 1243, summer 1312, in Piemont, near Paris, south of the Duchy of
Burgundy), incomplete (we do not know what a given university student did between two dates, we
do not know in which university he studied theology), even inconsistent or unreliable. But moreover,
most of the facts and places are temporally contextualized : for example, Flanders are independant
in the 11th century while they were Burgundian in the 14th century, the territories of bishoprics and
feudal authorities, moreover often formed of non-contiguous zones, etc. These two aspects make it
difficult to identify named entities on this data using classic tools like CoreNLP, spaCy
or NLTK.

Finally, the last objective of this post-doctorate will consist in extracting and quantifying the
uncertainty relating to the various data, including the information concerning the entities resulting
from the identification technique proposed in the first part of the post-doctorate. This approach may,
depending on the progress of the post-doctorate, be supplemented by the development of inference
rules in order to evaluate the uncertainty of data related to those for which an uncertainty has been
extracted. This uncertainty can then be used during subsequent phases of the Daphné project, in
particular for questioning or extracting hypotheses.

** Candidate background **

• PhD in computer science
• good level with knowledge databases (ontology, NLP, NER)
• proficient programming skills (Java, Python)
Feel free to apply even if you do not have precisely all the mentioned skills.

** Workplace, dates, salary **

• The candidate will be integrated in the ISID team (CÉDRIC lab at CNAM Paris), under the
superivision of Cédric du Mouza, Jacky Akoka and Isabelle Wattiau. Offices are located in the
center of Paris. Due to the ongoing pandemic, the beginning of the contract (at least) will be
potentially be remote.
• One-year contract, with a starting date in September 2021, if possible.
• The net salary will be around 2,500 euros per month.

** Contact **

Please send your application with a resume at the following addresses :
dumouza@cnam.fr, jacky.akoka@lecnam.net and isabelle.wattiau@essec.edu.

** Références **

[1] J. Akoka, I. Comyn-Wattiau, S. Lamassé, and C. du Mouza. Modeling historical social networks databases.
In T. Bui, editor, 52nd Hawaii International Conference on System Sciences, HICSS 2019, Grand
Wailea, Maui, Hawaii, USA, January 8-11, 2019, pages 1–10, 2019.
[2] J. Akoka, I. Comyn-Wattiau, S. Lamassé, and C. du Mouza. Contribution of conceptual modeling to enhancing
historians’ intuition - application to prosopography. In Conceptual Modeling - 39th International
Conference, ER, 2020.
[3] J. Akoka, I. Comyn-Wattiau, S. Lamassé, and C. Du Mouza. Conceptual modeling of prosopographic
databases integrating quality dimensions. Journal of Data Mining and Digital Humanities, 2021.