Current Projects

OYSTER – Open SYstem for Entity Resolution
Sponsor: ERIQ Research Center
Project Leader: John R. Talburt
An ongoing project, OYSTER has been the primary ER research platform for the ERIQ Center since 2009. Originally developed as a demonstration tool for teaching the principles of ER, it has grown to be a useful and widely adopted open source system. Originally posted on SourceForge.net as project “OysterER”, it has now moved to BitBucket.net under the same designation “OysterER.”

OFFER – An Open Framework for Entity Resolution
Sponsor: ERIQ Research Center
Project Leader: James True
OFFER is project to re-architect the OYSTER Open Source Entity Resolution into an open Java framework.

ER for Material MDM
Collaborator: PiLog International, South Africa
Project Leader: John R. Talburt (ERIQ), Salomon De Jager (PiLog)
A project to understand the special use cases for ER in materials and parts cataloging. Working with PiLog South Africa, the ERIQ Team is developing a number of adaptions of OYSTER to assess its effectiveness for use in non-party MDM and ER.

Positive Data Control
Sponsor: ERIQ Research Center
Project Leader: Yanbin Ye
A project to synchronize datasets and data catalog entries in the Big Data (HDFS) distributed processing environment. Initial use cases use the newly released Apache Atlas open source data catalog.

Application of Machine Learning to ER
Sponsor: ERIQ Research Center
Project Leader: Yumeng Ye
A project to explore the use of machine learning and other AI techniques for ER. Initial uses cases are exploring the application of logistic regression for record linking.

Multi-Value Comparators for Unstructured and Semi-Structured Characteristic Data
Sponsor: ERIQ Research Center
Project Leader: Xinming Li
A project experimenting with techniques for comparing and linking data with unstructured or semi-structured organization, e.g. free-form name and address fields and short-text social media.

Past Projects

Comparing the Effectiveness of Deterministic Matching with Probabilistic Matching for Student Enrollment ER
Sponsor: Arkansas Department of Education (ADE)
Principal Investigator:  John R. Talburt
June 2009 – March 2014

Clinical and Translation Science Initiative
Sponsor: National Institute of Health (NIH) through the University of Arkansas for Medical Sciences (UAMS)
Principal Investigator:  John R. Talburt
June 2009 – March 2014
UAMS has an undertaken an initiative to enhance the quality and integration of research data, to provide tools for collaboration, and to develop information quality training for researchers, staff and students.  As part of this, UALR and UAMS have agreed three research and development efforts that support the overall initiative. The first has to do with enhancing the quality of research data warehouse.  The second is providing tools for collaboration and community engagement to support translational research.  Research and investigation will be done into the different collaboration and community communication tools including wikis, web logs, discussion forums, groupware, screen sharing, and content management.  The third is the development and implementation of Information Quality Training Classes.  The training will focus on three groups:  students enrolled in Bioinformatics graduate programs, students enrolled in graduate programs related to translational research, and faculty and staff involved in translational research.

Referent Tracking in Health Care
Sponsor: University of Arkansas for Medical Sciences (UAMS)
Principal Investigator:  John Talburt
July 2012 – June 2013
A collaborative project intended to explore the integration of Referent Tracking (RT) software developed by faculty and staff at UAMS and the Open sYSTem for Entity Resolution (OYSTER) software developed by faculty, staff, and students at the University of Arkansas at Little Rock.  Referent Tracking seeks to assign an instance unique identifier (IUI) to each individual person, process, disease, prescription, fracture, tumor, etc.  Duplicate assignment of IUIs is therefore highly problematic, as is the assignment of one IUI to two different individual entities.  OYSTER is software that can detect such errors in IUI assignment.  Furthermore, OYSTER has a mode of operation that relies on tracking unique individual persons and addresses over time.  It can thus be extended through Referent Tracking to track, and de-duplicate IUIs for diseases, prescriptions, tumors, fractures, and other types of entities.

Towards High-Quality of Identity Attributes
Sponsor: Arkansas Department of Education (ADE)
Principal Investigator:  Ningning Wu, Co-Principal Investigator: John R. Talburt
July 2012 – June 2013
The primary goal of this research is to investigate how the quality of identity attributes will impact the quality of entity resolution of Arkansas K-12 student records.  It will first study the data quality of identity attributes to identify the key quality problems, then evaluate how quality of the identity attributes relates to the quality of entity resolution (ER) results from the aspect of false negatives. It will attempt to rand identity attributes according to their impact on ER results as a means prioritize the data quality improvement efforts. The results of this study will enable ADE to communication more efficiently with schools and raise their awareness of data quality as well as to help them improve the data quality in data collection and submission processes. The results of the study will also provide guidance for refining the Cycle Validation/Data Accuracy Reports and certification process of Statewide Information Systems (SIS) so that they can be more effective in enforcing the data quality standards across all schools.

Information Quality Tools for Persistent Surveillance Data Sets.
Sponsor: US Air Force Research Laboratory, Sensor Directorate, Wright Patterson Air Force Base, Dayton, Ohio.
Principal Investigator:  John Talburt; Co-Principal Investigators: Serhan Dagtas, Mariofanna Milanova, Mihail E. Tudoreanu
June 2009 – October 2011
The Air Force desires a comprehensive vehicle to identify and address requirements for information quality tools and techniques that will support defensive and offensive operations research in the layered sensing domain.  As use of remote sensors in the Air and Space domains increases, the value of the sensor datasets must be maximized and assurances established that the product outcomes meet the application requirements.  As multiple sensors are combined into layered sensing systems, this increases the need to understand not only the quality and fitness for using the individual sensor data streams, but also how to assess the quality and value of the aggregate data. The scope of this task order is to develop metrics that assess the quality and effectiveness of persistent surveillance data sets.  The project also explores the use of 3-D visualization in rendering layered data sets and experiments with the integration of textual information.  In addition, integrating processing of data available from multiple types of sensors (such as in a Smart Environment) has been explored, and experiments have been done to support data fusion for multiple sensors.

Proof-of-Concept for an Open-System Entity Resolution Engine to Support Longitudinal Studies in Education
Sponsor: Arkansas Department of Education
Principal Investigator: John Talburt; Co-Principal Investigator: Ningning Wu
June 2009 – May 2010
The TRUSTed Project is a research project to investigate the design for an open-system, entity resolution engine for the Arkansas Department of Education that will support longitudinal (multi-year) studies of student performance in Arkansas schools.  A prototype of the design will be implemented using open system tools and standards.

Delta Center for Identity Solutions
A collaboration with Arkansas State University Center for the Study of Automatic Identification
Sponsor: Arkansas Science and Technology Authority
Investigators: John Talburt, Farhad Moeeni (Arkansas State University), Dale Thompson (University of Arkansas, Fayetteville)
Research related to both identity management and identity information management. Identity management refers to the accurate, real-time, secure and tamper-proof identification and authorization of people for physical access to facilities, logical access to computer networks and for performance of various transactions. Identity information management entails measuring and improving information quality, detecting fraudulent behavior, and understanding data privacy and protection issues.

A Semiotic Approach to Layout Inference and Data Transformation
Sponsor: Acxiom Laboratory for Applied Research
Investigators: John Talburt, Ningning Wu, Chia-Chu Chiang
An investigation into the application of the principles of semiotics (intention, syntax, and semantics) to the problem of locating and identifying information in datasets of unknown layout through syntactic and semantic profiling. It also envisions the reuse of profiling information to the problem of developing a declarative language to describe and implement data transformations.

Methods and Techniques for Entity Identification in Open Source Documents with Partially Redacted Attributes
Sponsor: National Science Foundation
Investigators: John Talburt, Chia-Chu Chiang, Ningning Wu, Richard Wang (MIT)
An investigation into the degree to which partial identity information (“identity fragments”) can be resolved within a large reference set of known identities based on a variety of collateral evidence including quality of match, age consistency, and family member co-location.