Write a Blog >>
MSR 2018
Mon 28 - Tue 29 May 2018 Gothenburg, Sweden
co-located with * ICSE 2018 *
Tue 29 May 2018 11:48 - 11:54 at E3 room - Data Showcase

The number of open source software projects has been growing exponentially. The major online software repository host, GitHub, has accumulated tens of millions of publicly available Git version-controlled repositories. Although the research potential enabled by the available open source code is clearly substantial, no significant large-scale open source code datasets exist. In this paper, we present the Public Git Archive – dataset of 182,014 top-bookmarked Git repositories from GitHub. We describe the novel data retrieval pipeline to reproduce it. We also elaborate on the strategy for performing dataset updates and legal issues. The Public Git Archive occupies 3.0 TB on disk and is an order of magnitude larger than the current source code datasets. The dataset is made available through HTTP and provides the source code of the projects, the related metadata, and development history. The data retrieval pipeline employs an optimized worker queue model and an optimized archive format to efficiently store forked Git repositories, reducing the amount of data to download and persist. Public Git Archive aims to open a myriad of new opportunities for “Big Code” research. Dataset URL

Conference Day
Tue 29 May

Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

11:00 - 12:30
Data ShowcaseData Showcase at E3 room
11:00
6m
Short-paper
50K-C: A dataset of compilable, and compiled, Java projects
Data Showcase
A: Pedro MartinsUniversity of California at Irvine, USA, A: Crista LopesUniversity of California Irvine, A: Rohan Achar
11:06
6m
Short-paper
JBench: A Dataset of Data Races for Concurrency Testing
Data Showcase
A: Jian GaoSchool of Software, Tsinghua University, A: Xin Yang , A: Yu Jiang, A: Han Liu, A: Weiliang Ying , A: Xian Zhang
11:12
6m
Short-paper
Bugs.jar: A Large-scale, Diverse Dataset of Real-world Java Bugs
Data Showcase
A: Ripon Saha, A: Yingjun LyuUniversity of Southern California, A: Wing LamUniversity of Illinois at Urbana-Champaign, A: Hiroaki YoshidaFujitsu Laboratories of America, Inc., A: Mukul PrasadFujitsu Laboratories of America
11:18
6m
Short-paper
A Gold Standard for Emotion Annotation in Stack Overflow
Data Showcase
A: Nicole NovielliUniversity of Bari, A: Fabio CalefatoUniversity of Bari, A: Filippo LanubileUniversity of Bari
Pre-print
11:24
6m
Short-paper
Vulinoss: A Dataset of Security Vulnerabilities in Open-source Systems
Data Showcase
A: Antonios Gkortzis Athens University of Economics and Business, A: Dimitris Mitropoulos, A: Diomidis SpinellisAthens University of Economics and Business
Pre-print
11:30
6m
Short-paper
A Dataset of Duplicate Pull-requests in GitHub
Data Showcase
A: Zhixing Li College of Computer, National University of Defense Technology, Changsha, China, A: Yue Yu National University of Defense Technology, A: Gang YinNational University of Defense Technology, A: Tao WangNational University of Defense Technology, A: Huaimin Wang
Pre-print
11:36
6m
Short-paper
Structured Information on State and Evolution of Dockerfiles on GitHub
Data Showcase
DOI Pre-print
11:42
6m
Short-paper
A Graph-based Dataset of Commit History of Real-World Android apps
Data Showcase
A: Franz-Xaver Geiger , A: Ivano MalavoltaVrije Universiteit Amsterdam, A: Luca PascarellaDelft University of Technology, A: Fabio Palomba, A: Dario Di NucciVrije Universiteit Brussel, A: Alberto BacchelliUniversity of Zurich
DOI Pre-print
11:48
6m
Short-paper
Public Git Archive: a Big Code dataset for all
Data Showcase
A: Vadim Markovtsevsource{d}, A: Waren Longsource{d}
DOI Pre-print
11:54
6m
Short-paper
Word Embeddings for the Software Engineering Domain
Data Showcase
A: Vasiliki EfstathiouAthens University of Economics and Business, A: Christos Chatzilenas , A: Diomidis SpinellisAthens University of Economics and Business
DOI Pre-print
12:00
6m
Short-paper
npm-miner: An Infrastructure for Measuring the Quality of the npm Registry
Data Showcase
A: Kyriakos Chatzidimitriou Aristotle University of Thessaloniki, A: Michail Papamichail , A: Themistoklis DiamantopoulosElectrical and Computer Engineering Dept, Aristotle University of Thessaloniki, A: Michail Tsapanos , A: Andreas Symeonidis
DOI Pre-print
12:06
6m
Short-paper
CROP: Linking Code Reviews to Source Code Changes
Data Showcase
A: Matheus PaixaoUniversity College London, A: Jens KrinkeUniversity College London, A: DongGyun HanUniversity College London, A: Mark HarmanFacebook and University College London
DOI Pre-print
12:12
6m
Short-paper
Developer Interaction Traces backed by IDE Screen Recordings from Think-aloud Sessions
Data Showcase
A: Aiko YamashitaOslo Metropolitan University, A: Fabio PetrilloConcordia University, A: Foutse KhomhPolytechnique Montréal, A: Yann-Gaël GuéhéneucConcordia University and Polytechnique Montréal
Pre-print
12:18
6m
Short-paper
A Multi-level Dataset of Linux Kernel Patchwork
Data Showcase
A: Yulin XuPeking University, A: Minghui ZhouPeking University
DOI Pre-print
12:24
6m
Short-paper
Documented Unix Facilities Over 48 Years
Data Showcase
A: Diomidis SpinellisAthens University of Economics and Business
Link to publication DOI Media Attached