At Delft we have been implementing schema matching algorithms for use in dataset discovery for more than one year. The result of our work is a collection of algorithms and curated datasets that constitute the Valentine project:

The algorithms included are:
 1. Coma: Python wrapper around COMA 3.0 Community edition
 2. Cupid: Contains the python implementation of the paper "Generic Schema Matching with Cupid" (VLDB 2001)
 3. Distribution-based: Contains the python implementation of the paper "Automatic Discovery of Attributes in Relational Databases" (SIGMOD 2011)
 4. Embdi: Contains the code of EmbDI provided by the authors in their GitLab repository and the paper "Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks" (SIGMOD 2020)
 5. Jaccard levenshtein: Contains our own baseline that uses Jaccard Similarity between columns to assess their correspondence score, enhanced by Levenshtein Distance
 6. SemProp: Contains the code of Seeping Semantics provided in the paper "Aurum: A Data Discovery System" (ICDE 2018).
 7. Similarity Flooding: Contains the python implementation of the paper "Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching" (ICDE 2002).

We ran numerous experiments evaluating these algorithms against many datasets, and 550 column-pairs for which we curated ad offer ground truth for matches (see project page).

The results of this work are described in our preprint:
"Valentine: Evaluating Matching Techniques for Dataset Discovery": Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, Asterios Katsifodimos.

We invite the community to use the algorithms and the datasets in their papers, and to submit possible bugs and feature requests. And give us a star on github.

We are working hard to make all these as usable as possible and we look forward to seeing them being used!

Asterios Katsifodimos