Understanding a collection item’s history, whether it is a painting in an art museum, a specimen in a natural history collection, or documents in an archive, is an essential part of responsible stewardship of that object. Unlike with digital files in asset management systems, unless there is active work to preserve a paper trail of ownership and location, this information is quickly lost. Heritage organizations maintain information about their accessioned collections, however this is primarily done with handwritten notes, letters and other correspondence, clippings from newspapers or auction catalogs, or printed out documents in a folder. These auxiliary files are both extensive and hard to use, as they require physical presence and to know where to look. They also show only a small part of the object’s history, with often significant events such as exhibitions or conservation processes being kept in different systems, or by different organizations.
The research topic to be explored is the extent to which AI can accelerate provenance research processes, including especially the use of Large Language Model based transcription of printed and handwritten documents, and subsequent extraction of knowledge from the transcribed text as to the people, places, organizations and events in which the object participated. This information should then be able to be analyzed by both humans and software to rapidly provide insight into the collections for provenance researchers. This will be validated through real world use cases, working with provenance experts across Yale’s collections.
The primary research goals include:
- Understand the extent of existing digital images of text, and how the information locked away in them would advance provenance understanding
- Acquire additional external data, such as the Getty Provenance Index, and any images of content from which that data was derived
- Extract text from internal and external provenance documents with LLM-powered OCR/HTR
- Extract named entities from the text, and align with external knowledge sources such as Wikipedia/Wikidata, the Getty, Library of Congress, National Libraries, and beyond
- Generate structured data descriptions of the objects from the knowledge in the images and extracted text, to generate and align with existing collection metadata
- Align entities with existing entities/records, including the objects themselves to bootstrap knowledge graph creation
- Extract relationships between the entities discovered and form an internal knowledge graph and timeline of events across the full corpus
- Build, potentially using AI code assistants, a user interface through which provenance researchers can interact with the knowledge
- Conduct experiments in collaboration with provenance researchers to determine the accuracy, functionality and utility of the system
Mentoring Program:
Yale is uniquely positioned to support this research, with significant collections, art historical experts, technical expertise and connections to related endeavors and communities. The postdoc will have direct access to all of the expertise needed for the research to be successful, and can be embedded within museums or more technical parts of the University for periods of time to learn about both research and operational workflows.
By the end of the appointment, the post-doctoral associate will have reinforced and extended their understanding of cutting-edge natural language processing, knowledge graph techniques, provenance and art historical understanding, museum and archival processing, fine-tuning language models and other AI models, data transformation, and software engineering techniques for sustainable applications. While this position is based in the Collections and Scholarly Communication division (Yale’s Libraries, Archives and Museums) there are substantial opportunities for interaction and collaboration with faculty and other scholars, both at Yale and beyond.
Internal Network:
- Dr Robert Sanderson – Primary mentor and supervisor
- Dr William Mattingly – Secondary mentor, assistance with LLM processing
- Dr Agnete Lassen (YPM) and Antonia Bartoli (YUAG) – chairs of the Collections Provenance Working Group
- Yer Vang-Cohen – Head of IT at YUAG
External Network:
- Prof Dr Lynn Rother, Leuphana University (Germany) and Museum of Modern Art (NY)
- Prof Michelle Fabiani, University of New Haven, CURIA Lab Director
- Dr Sandra van Ginhoven, Getty, Head of the Getty Provenance Index (GPI) program
- Kelly Davis, Head of Collections IT, Rijksmuseum, NL (previously Yale, and Getty GPI)
Required Qualifications:
- Ph.D. in Digital Humanities, Computer Science, Library/Information Science, Cultural Heritage Data, or related field
- Proficiency in Python, and experience with data analysis
To Apply:
Please send a cover letter, CV/Resumé, and 2-3 references to robert.sanderson@yale.edu, with the subject line: Provenance Post-Doc Application
Funding Sources:
The William S. Reese ‘77 Digital Cultural Heritage Postdoctoral Program
Teaching Responsibilities:
None
Compensation and Benefits:
https://postdocs.yale.edu/postdocs/being-a-postdoc-at-yale/postdoctoral-compensation
https://postdocs.yale.edu/postdocs/being-a-postdoc-at-yale/benefits-summary