You can download either a commulattive dump from the recent months or the SemanGit Scripts for the generation of a single or multiple dumps. If you need the data of an older dump like 2016-03-01, please download the linux script and compute it manually.
As a first step, we have created an ontology for Git and extended it to also take the social features of GitHub into account. An interactive visualization of our ontology can be found on WebVOWL . Due to limitations of type setting the underlying ontology is not entirely correclty typed. A full and correct version can be obtained from GitHub.
Due to the fact that we grab all our data from GHTorrent , we have decided to take their structure into account when developing our ontology. All relevant information provided by them has been included in our structure, too. We have ensured to clearly distinguish between Git properties and properties that do not belong to the Git protocol, but which are provided by GitHub. In this case, we have created a subclass that can hold all the additional information and make the basic Git class superclass of it. As an example, a "Git user" only consists of an email address. That is to say, when creating a commit, one can specify the author of the commit by providing an email address. A GitHub user on the other hand is a lot more complex. It has a nickname, avatar, a creation date, a location etc. So in our ontology, there is a class called User which only has the user email data field, and a github user subclass that has 9 more fields, including those mentioned above. On GitHub, users can leave comments on commits, pull requests and on reported issues. Seeing that those are very similar, we have grouped those three classes as commentable, such that a comment can be made for any class that is subclass of commentable. Our ontology comprises of 22 classes and 80 properties. During the creation process, we used the WebVOWL editor tool for building the ontology and the OOPS! OntOlogy Pitfall Scanner! for checking the integrity of it.
# Unoptimized Data
semangit:ghissue_123456 a semangit:github_issue;
# With prefixing
u:123456 a x:;
# With Base64 like integer representation
u:x3T a x:;
Papers, Posters, Reports and especially the WebVowl Interace a follow their their own respective licenses. For the Code we publish on GitHub and especially the datasets following licenses apply
Copyright (c) 2019 Matthias Böckmann, Dennis Kubitza
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
We thank Georgios Gousios for proving the ghtorrent - Dataset.
The SemanGit Team:
Powered by w3.css