The discovery of existing datasets relevant to a scholar’s research is a daunting task, made more complicated by a lack of standards in descriptive practices, the size of datasets, and the scale and interdisciplinarity of many research topics. Universities and research institutions cannot store all possible datasets for their researchers, and instead focus on data management for their institutionally produced data. This does not meet the needs of research discovery processes however, as internal researchers likely know the others in their specific discipline and what they’re working on already. Researchers, if they deposit their datasets at all, frequently do not have the time nor information management expertise to describe the data sufficiently well for discovery. The research datasets are typically found by reading scholarly journal articles and then leveraging professional networks to obtain a copy, or by going to web-scale search engines or AI to find what has made it to the open web.
At Yale, we envision a different future. This post-doctoral position will investigate the use of knowledge graphs on automatically extracted metadata at a cross-disciplinary global scale. Using LLMs and traditional data engineering techniques, we will discover, characterize, reconcile, enrich and connect the datasets with the subjects, people, institutions, places, research projects, and funding programs in order to facilitate both graph-based analysis of research data as a meta-domain and discovery of relevant datasets through both graph search and guided browsing through relationships. Along with Yale managed datasets, these will also be co-aligned with scholarly literature, natural history specimens, and art museum collections in Yale’s LUX platform. Just as arXiv, medRxiv, bioRxiv, pysArXiv, and other disciplinary pre-print servers have been significantly more successful than institutional repositories, we believe the same model can easily be applied to detailed, structured metadata about datasets, while the datasets themselves are hosted in their current repositories.
The primary research goals include:
- Discovery of large-scale dataset repositories, and harvesting metadata from them, including the German NFDI network, the EU’s OpenAIRE, and others listed in the Registry of Research Data Repositories.
- Harmonizing and transformation of the metadata into a knowledge graph structure.
- Extraction of entities and reconciliation against cross-domain authorities and meta-datasets, such as Wikidata, the Library of Congress, Geonames, and beyond
- Enriching dataset descriptions using the graph, LLM summarization of the raw data, and other techniques to ensure the records are well connected and well described.
- Alignment of natural history datasets at the data rather than metadata level
- Using the LUX paradigm and code to construct a discovery interface, including LLM powered natural language search
- Conduct experiments in collaboration with faculty, students and other researchers into the effectiveness of the solution in accelerating their data-oriented research tasks
Mentoring Program:
The post-doc will have direct access to all of the expertise needed for the research to be successful, and can be embedded within the Research Library, the Peabody Museum, or more technical parts of the University for periods of time to learn about both research and operational workflows. Connections with the Wu Tsai Institute, the AI at Yale program, the Data Intensive Social Sciences Center and the Yale Center for Research Computing will be leveraged to ensure access to a broad and deep internal network.
By the end of this appointment, the post-doctoral associate will have reinforced and extended their understanding of cutting-edge data transformation, knowledge graph techniques, the research process itself across disciplines, and software engineering techniques for sustainable applications.
Internal Network:
- Dr Robert Sanderson - Primary mentor and supervisor
- Ben Norton - Cultural Heritage Data Engineer, C&SC
- Dr William Mattingly - Cultural Heritage Data Scientist, C&SC
- Dr Rebecca Dikow - Director of Research Innovation, Yale University Libraries
- Dr Gary Motz - Head of Computer Systems, Yale Peabody Museum
- Jeff Campbell – Associate Director for Cultural Heritage Technology, ITS
Required Qualifications:
- Ph.D. in Computer Science, Library/Information Science, Knowledge Graphs / Linked Data in any discipline, or related field
- Proficiency in Python, and experience with data analysis and transformation
To Apply:
Please send a cover letter, CV/Resumé, and 2-3 references to robert.sanderson@yale.edu, with the subject line: Dataset Discovery Post-Doc Application
Funding Sources:
The William S. Reese ‘77 Digital Cultural Heritage Postdoctoral Program
Yale University Library Innovation Fund
Teaching Responsibilities:
None
Compensation and benefits:
https://postdocs.yale.edu/postdocs/being-a-postdoc-at-yale/postdoctoral-compensation
https://postdocs.yale.edu/postdocs/being-a-postdoc-at-yale/benefits-summary
Appointed for one year, renewable for a second year