About

SemanGit is the first collection of linked data extracted from GitHub based on a git ontology we designed and extended to include specific GitHub features. Here we provide this ontology for a semantic representation of the git-protocol and its derviates (Github, Bitbucket etc.) and a cummultative turtle-datasets of publicly accessible data on Github. Technical documentation on the scripts can be obtained from Github Project Page . A short introduction to the Ontology can be found under the heading Ontology . In addition to that a list of our scientific works, about SemanGit can be found at .

News

- Updated publications and added new links to the homepage. (27th Oct 2019)
- We are happy to announce, that two publications of us will appear soon at the ISWC 2019. (21st Sept 2019)
- August Dump is still delayed due to missing new data from GHtorrent. (13th August 2019)
- June Dump was uploaded. (15th June 2019)
- Locations are now encoded as http://dbpedia.org/ontology/city, enabling an interlinkage to DBpedia
- Fixed faulty generation of Issue IDs

Downloads.

You can download either a commulattive dump from the recent months or the SemanGit Scripts for the generation of a single or multiple dumps. If you need the data of an older dump like 2016-03-01, please download the linux script and compute it manually.

Dumps
June 2019

Scripts and Instructions
Linux Download Script
A short script which automatically downloads and transforms the data to rdf. Compatible with all common Linux distribution. SHA256: 13069e7b 30f9f380 35243113 864c99cf fe62e07c bc9e2d3a 2486996c cea71411
Windows/Mac Instructions
Instructions on the manual processing steps required to generate the SemanGit RDF

Publications

While not counting as a publication in the scientific sense, there exists this project report for the initial version of the converter written for module CS-4313 at the University of Bonn.

Notes on the Ontology

As a first step, we have created an ontology for Git and extended it to also take the social features of GitHub into account. An interactive visualization of our ontology can be found on WebVOWL . Due to limitations of type setting the underlying ontology is not entirely correclty typed. A full and correct version can be obtained from GitHub.

(Note: To display everything choose: Filter > Degree of Collapsing > 0 )

Due to the fact that we grab all our data from GHTorrent , we have decided to take their structure into account when developing our ontology. All relevant information provided by them has been included in our structure, too. We have ensured to clearly distinguish between Git properties and properties that do not belong to the Git protocol, but which are provided by GitHub. In this case, we have created a subclass that can hold all the additional information and make the basic Git class superclass of it. As an example, a "Git user" only consists of an email address. That is to say, when creating a commit, one can specify the author of the commit by providing an email address. A GitHub user on the other hand is a lot more complex. It has a nickname, avatar, a creation date, a location etc. So in our ontology, there is a class called User which only has the user email data field, and a github user subclass that has 9 more fields, including those mentioned above. On GitHub, users can leave comments on commits, pull requests and on reported issues. Seeing that those are very similar, we have grouped those three classes as commentable, such that a comment can be made for any class that is subclass of commentable. Our ontology comprises of 22 classes and 80 properties. During the creation process, we used the WebVOWL editor tool for building the ontology and the OOPS! OntOlogy Pitfall Scanner! for checking the integrity of it.

Notes on the Dataset

Our datasets are provided as a Resource Description Framework in the turtle format, with a size of around 400GB for each single dump and around 20.000.000.000 stored rdf triples. To achieve a maximum compression rate, we optimzed our data by choosing prefixes of the length of at most 2 chars, and reencoded all integers into a Base 64 like representation.

 

 
# Unoptimized Data
 
semangit:ghissue_123456 a semangit:github_issue;
 
semangit:github_issue_created_at "2002-05-30T09:00:00"^^xsd:dateTime;
 
semangit:github_issue_project semangit:ghrepo_234567;
 
semangit:github_issue_assignee semangit:ghuser_345678.
 

 
# With prefixing
 
u:123456 a x:;
 
C: "2002-05-30T09:00:00"^^xsd:dateTime;
 
y: :234567;
 
A: m:345678.
 

 
# With Base64 like integer representation
 
u:x3T a x:;
 
C: "2002-05-30T09:00:00"^^xsd:dateTime;
 
y: :WR9;
 
A: m:af93.

A detailed list of all used prefixes can be obtained from the first lines from each dump.

Licenses

Papers, Posters, Reports and especially the WebVowl Interace a follow their their own respective licenses. For the Code we publish on GitHub and especially the datasets following licenses apply

Software

Copyright (c) 2019 Matthias Böckmann, Dennis Kubitza
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE

Datasets
(non-commerical)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License in accordance with The GHtorrent project

We thank Georgios Gousios for proving the ghtorrent - Dataset.

The GHTorrent dataset and tool suite
Gousios, Georgios
Proceedings of the 10th Working Conference on Mining Software Repositories p. 233--236. 2013
[bibtex]

Datasets
(commerical)

For the commercial usage of our Datsets, only the restrictions by our data provider ghtorrent.com apply. Detailed information and contacts for the commercial usage can be found on ghtorrent/faq

Contact

The SemanGit Team:

Damien Graux

damien.graux@adaptcentre.ie.

Matthias Böckmann

matthias.boeckmann@ iais.fraunhofer.de

Dennis Kubitza

dennis.kubitza@uni-bonn.de

About

News

Downloads.

Dumps

Scripts and Instructions

Linux Download Script

Windows/Mac Instructions