Word Embeddings for the Software Engineering Domain (MSR 2018 - Data Showcase) - MSR 2018

Write a Blog >>

Mon 28 - Tue 29 May 2018 Gothenburg, Sweden

co-located with * ICSE 2018 *

Who

Vasiliki Efstathiou, Christos Chatzilenas , Diomidis Spinellis

Track

MSR 2018 Data Showcase

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

When

Tue 29 May 2018 11:54 - 12:00 at E3 room - Data Showcase

Abstract

The software development process produces vast amounts of textual data expressed in natural language. Outcomes from the natural language processing community have been adapted in software engineering research for leveraging this rich textual information; these include methods and readily available tools, often furnished with pre–trained models. State of the art pre–trained models however, capture general, common sense knowledge, with limited value when it comes to handling data specific to a specialized domain. There is currently a lack of domain-specific pre–trained models that would further enhance the processing of natural language artefacts related to software engineering. To this end, we release a word2vec model trained over 15GB of textual data from Stack Overflow posts. We illustrate how the model disambiguates polysemous words by interpreting them within their software engineering context. In addition, we present examples of fine-grained semantics captured by the model, that imply transferability of these results to diverse, targeted information retrieval tasks in software engineering and motivate for further reuse of the model.

Link to Preprint

https://github.com/vefstathiou/SO_word2vec/blob/master/MSR18-w2v.pdf

DOI

https://doi.org/10.1145/3196398.3196448

Vasiliki EfstathiouAuthor

Athens University of Economics and Business

Greece

Christos Chatzilenas Author

Diomidis SpinellisAuthor

Athens University of Economics and Business

Greece

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Session Program

Tue 29 May
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

	11:00 - 12:30	Data ShowcaseData Showcase at E3 room

	11:00 6m Short-paper		50K-C: A dataset of compilable, and compiled, Java projects Data Showcase A: Pedro Martins University of California at Irvine, USA, A: Crista Lopes University of California Irvine, A: Rohan Achar
	11:06 6m Short-paper		JBench: A Dataset of Data Races for Concurrency Testing Data Showcase A: Jian Gao School of Software, Tsinghua University, A: Xin Yang , A: Yu Jiang , A: Han Liu , A: Weiliang Ying , A: Xian Zhang
	11:12 6m Short-paper		Bugs.jar: A Large-scale, Diverse Dataset of Real-world Java Bugs Data Showcase A: Ripon Saha , A: Yingjun Lyu University of Southern California, A: Wing Lam University of Illinois at Urbana-Champaign, A: Hiroaki Yoshida Fujitsu Laboratories of America, Inc., A: Mukul Prasad Fujitsu Laboratories of America
	11:18 6m Short-paper		A Gold Standard for Emotion Annotation in Stack Overflow Data Showcase A: Nicole Novielli University of Bari, A: Fabio Calefato University of Bari, A: Filippo Lanubile University of Bari Pre-print
	11:24 6m Short-paper		Vulinoss: A Dataset of Security Vulnerabilities in Open-source Systems Data Showcase A: Antonios Gkortzis Athens University of Economics and Business, A: Dimitris Mitropoulos , A: Diomidis Spinellis Athens University of Economics and Business Pre-print
	11:30 6m Short-paper		A Dataset of Duplicate Pull-requests in GitHub Data Showcase A: Zhixing Li College of Computer, National University of Defense Technology, Changsha, China, A: Yue Yu National University of Defense Technology, A: Gang Yin National University of Defense Technology, A: Tao Wang National University of Defense Technology, A: Huaimin Wang Pre-print
	11:36 6m Short-paper		Structured Information on State and Evolution of Dockerfiles on GitHub Data Showcase A: Gerald Schermann , A: Sali Zumberi , A: Jürgen Cito MIT DOI Pre-print
	11:42 6m Short-paper		A Graph-based Dataset of Commit History of Real-World Android apps Data Showcase A: Franz-Xaver Geiger , A: Ivano Malavolta Vrije Universiteit Amsterdam, A: Luca Pascarella Delft University of Technology, A: Fabio Palomba , A: Dario Di Nucci Vrije Universiteit Brussel, A: Alberto Bacchelli University of Zurich DOI Pre-print
	11:48 6m Short-paper		Public Git Archive: a Big Code dataset for all Data Showcase A: Vadim Markovtsev source{d}, A: Waren Long source{d} DOI Pre-print
	11:54 6m Short-paper		Word Embeddings for the Software Engineering Domain Data Showcase A: Vasiliki Efstathiou Athens University of Economics and Business, A: Christos Chatzilenas , A: Diomidis Spinellis Athens University of Economics and Business DOI Pre-print
	12:00 6m Short-paper		npm-miner: An Infrastructure for Measuring the Quality of the npm Registry Data Showcase A: Kyriakos Chatzidimitriou Aristotle University of Thessaloniki, A: Michail Papamichail , A: Themistoklis Diamantopoulos Electrical and Computer Engineering Dept, Aristotle University of Thessaloniki, A: Michail Tsapanos , A: Andreas Symeonidis DOI Pre-print
	12:06 6m Short-paper		CROP: Linking Code Reviews to Source Code Changes Data Showcase A: Matheus Paixao University College London, A: Jens Krinke University College London, A: DongGyun Han University College London, A: Mark Harman Facebook and University College London DOI Pre-print
	12:12 6m Short-paper		Developer Interaction Traces backed by IDE Screen Recordings from Think-aloud Sessions Data Showcase A: Aiko Yamashita Oslo Metropolitan University, A: Fabio Petrillo Concordia University, A: Foutse Khomh Polytechnique Montréal, A: Yann-Gaël Guéhéneuc Concordia University and Polytechnique Montréal Pre-print
	12:18 6m Short-paper		A Multi-level Dataset of Linux Kernel Patchwork Data Showcase A: Yulin Xu Peking University, A: Minghui Zhou Peking University DOI Pre-print
	12:24 6m Short-paper		Documented Unix Facilities Over 48 Years Data Showcase A: Diomidis Spinellis Athens University of Economics and Business Link to publication DOI Media Attached