SemanGit

About


SemanGit is the first collection of linked data extracted from GitHub based on a git ontology we designed and extended to include specific GitHub features. Here we provide this ontology for a semantic representation of the git-protocol and its derviates (Github, Bitbucket etc.) and a cummultative turtle-datasets of publicly accessible data on Github. Technical documentation on the scripts can be obtained from Github Project Page . A short introduction to the Ontology can be found under the heading Ontology . In addition to that a list of our scientific works, about SemanGit can be found at .

News


Downloads.


You can download either a commulattive dump from the recent months or the SemanGit Scripts for the generation of a single or multiple dumps. If you need the data of an older dump like 2016-03-01, please download the linux script and compute it manually.

  • Scripts and Instructions

  • Linux Download Script


    A short script which automatically downloads and transforms the data to rdf. Compatible with all common Linux distribution. SHA256: 13069e7b 30f9f380 35243113 864c99cf fe62e07c bc9e2d3a 2486996c cea71411
  • Windows/Mac Instructions


    Instructions on the manual processing steps required to generate the SemanGit RDF

Publications

While not counting as a publication in the scientific sense, there exists this project report for the initial version of the converter written for module CS-4313 at the University of Bonn.

Notes on the Ontology


As a first step, we have created an ontology for Git and extended it to also take the social features of GitHub into account. An interactive visualization of our ontology can be found on WebVOWL . Due to limitations of type setting the underlying ontology is not entirely correclty typed. A full and correct version can be obtained from GitHub.


(Note: To display everything choose: Filter > Degree of Collapsing > 0 )

Due to the fact that we grab all our data from GHTorrent , we have decided to take their structure into account when developing our ontology. All relevant information provided by them has been included in our structure, too. We have ensured to clearly distinguish between Git properties and properties that do not belong to the Git protocol, but which are provided by GitHub. In this case, we have created a subclass that can hold all the additional information and make the basic Git class superclass of it. As an example, a "Git user" only consists of an email address. That is to say, when creating a commit, one can specify the author of the commit by providing an email address. A GitHub user on the other hand is a lot more complex. It has a nickname, avatar, a creation date, a location etc. So in our ontology, there is a class called User which only has the user email data field, and a github user subclass that has 9 more fields, including those mentioned above. On GitHub, users can leave comments on commits, pull requests and on reported issues. Seeing that those are very similar, we have grouped those three classes as commentable, such that a comment can be made for any class that is subclass of commentable. Our ontology comprises of 22 classes and 80 properties. During the creation process, we used the WebVOWL editor tool for building the ontology and the OOPS! OntOlogy Pitfall Scanner! for checking the integrity of it.

Notes on the Dataset


Our datasets are provided as a Resource Description Framework in the turtle format, with a size of around 400GB for each single dump and around 20.000.000.000 stored rdf triples. To achieve a maximum compression rate, we optimzed our data by choosing prefixes of the length of at most 2 chars, and reencoded all integers into a Base 64 like representation.

# Unoptimized Data
semangit:ghissue_123456 a semangit:github_issue;
semangit:github_issue_created_at "2002-05-30T09:00:00"^^xsd:dateTime;
semangit:github_issue_project semangit:ghrepo_234567;
semangit:github_issue_assignee semangit:ghuser_345678.

# With prefixing
u:123456 a x:;
C: "2002-05-30T09:00:00"^^xsd:dateTime;
y: :234567;
A: m:345678.

# With Base64 like integer representation
u:x3T a x:;
C: "2002-05-30T09:00:00"^^xsd:dateTime;
y: :WR9;
A: m:af93.

A detailed list of all used prefixes can be obtained from the first lines from each dump.

Archive


Licenses

Papers, Posters, Reports and especially the WebVowl Interace a follow their their own respective licenses. For the Code we publish on GitHub and especially the datasets following licenses apply


Software

Copyright (c) 2019 Matthias Böckmann, Dennis Kubitza
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE

Datasets
(non-commerical)

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License in accordance with The GHtorrent project


We thank Georgios Gousios for proving the ghtorrent - Dataset.

  • The GHTorrent dataset and tool suite
    Gousios, Georgios
    Proceedings of the 10th Working Conference on Mining Software Repositories p. 233--236. 2013
    [bibtex]

  • Datasets
    (commerical)

    For the commercial usage of our Datsets, only the restrictions by our data provider ghtorrent.com apply. Detailed information and contacts for the commercial usage can be found on ghtorrent/faq

    Contact


    The SemanGit Team:

    Damien Graux

    Damien Graux

    damien.graux@adaptcentre.ie.

    Matthias Boeckmann

    Matthias Böckmann

    matthias.boeckmann@ iais.fraunhofer.de

    Dennis Kubitza

    Dennis Kubitza

    dennis.kubitza@uni-bonn.de

    Powered by w3.css