Entity Resolution On-Demand

Entity resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive offline cleaning step performed on the entire data before consuming it, hence users struggle when the task at hand is characterized by an information need or time constraints (e.g., when the data changes frequently or the user is only interested in a portion of the dataset for the task). BrewER is a framework designed to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data, according to a priority defined by the user. BrewER is implemented as an open-source Python library and can be seamlessly integrated with existing ER tools and algorithms.
Project carried out with the contributions of Giovanni Simonini, Sonia Bergamaschi, Felix Naumann

Determining the Largest Overlap between Tables

Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. The detection of the largest overlap between tables, defined as their largest common subtable, can help in relevant tasks such as the discovery of multiple coexisting versions of the same table (i.e., duplicate tables), which may differ in completeness and correctness of the conveyed information. Sloth is a framework designed to efficiently determine the largest overlap between tables, detecting duplicate tables to make them consistent through data cleaning and change propagation or eliminate redundancy to free up storage space or save additional work for the editors.
Project carried out with the contributions of Tobias BleifuĂź, Giovanni Simonini, Sonia Bergamaschi, Felix Naumann - Logo created by Giacomo Pirani

Other projects

Evaluation of Dataframe Libraries for Data Preparation

Data preparation is a trial-and-error process that typically involves countless iterations over the data to define the best pipeline of operators for a given task. In particular, the dataframe is widely recognized as the foundational data structure to deal with tabular data. While Pandas is the de facto standard library for manipulating dataframes, many alternatives were created to overcome its limitations on large datasets. To support data scientists in the choice of the library that best suits the data preparation task at hand, we extensively evaluate the most popular Python dataframe libraries in general data preparation use cases, using real-world datasets and pipelines with distinct characteristics to cover several different scenarios.
Project carried out in collaboration with Angelo Mozzillo (main contributor), Adeel Aslam, Luca Gagliardelli, Giovanni Simonini, Sonia Bergamaschi

Digital Experience Platform

DXP manages the billing data of the users of different companies operating in several areas (electricity and gas, telephony, etc.). The goal of the platform is to acquire the billing data provided by these companies, then to process it in order to produce analytics for the companies (e.g., churn prediction or customer segmentation) and reports for their users (e.g., interactive billing).
Project carried out by the DBGroup @ UniMoRe, commissioned and supervised by Doxee, and funded by Regione Emilia-Romagna

Energy Community Data Platform

ECDP is a middleware platform designed to collect and analyze big data about the energy consumption and production inside local energy communities, with the aim of encouraging a more conscious use of energy by the users, at home and in the workplace. ECDP acquires data of different nature in a heterogeneous format from multiple sources. Its modular architecture, designed to support a data integration workflow and a data lake workflow, guarantees flexibility and scalability, allowing the applicability of ECDP to any type of local energy community.
Project carried out by the DBGroup @ UniMoRe in collaboration with DataRiver, supervised by ENEA, and funded by the Italian Ministry of Economic Development (MISE)