Creating subsets of Wikidata
10-30, 20:00–20:10 (UTC), Room 1

In many use cases, projects want to reuse a subset of Wikidata focused on project-specific topics, combining the Wikidata data with project-specific data. Extracting a subgraph of Wikidata is difficult because Wikidata is very large, and it is difficult to specify what to keep and what to discard. This talk presents Knowledge Graph Toolkit (KGTK, https://github.com/usc-isi-i2/kgtk), a toolkit that can process the full Wikidata on a laptop, and provides a rich suite of commands for query and path following that can be used to flexibly extract topic specific subgraphs.


Link to notes

https://etherpad.wikimedia.org/p/WikidataCon2021-CreatingsubsetsofWikidata

What will the participants take away from this session?

Participants will learn about KGTK, a sophisticated toolkit that provides commands to query Wikidata, perform network analytics and create graph embeddings. KGTK is scalable and efficient: KGTK queries run many times faster on a laptop than equivalent queries on SPARQL running on a large server. Participants will get a preview of how KGTK can be used to create topic-oriented subgraphs of Wikidata.

Language

English

Recording

Yes

Dr. Pedro Szekely is a Principal Scientist and Director of the AI division at the USC Information Sciences Institute (ISI. His research focuses on table understanding, knowledge graphs and applications of knowledge graphs. He teaches a graduate course at USC on Building Knowledge Graphs, and has given tutorials on knowledge graph construction at KDD, ISWC, AAAI and WWW.