Word Embeddings for the Software Engineering Domain
The software development process produces vast amounts of textual data expressed in natural language. Outcomes from the natural language processing community have been adapted in software engineering research for leveraging this rich textual information; these include methods and readily available tools, often furnished with pre–trained models. State of the art pre–trained models however, capture general, common sense knowledge, with limited value when it comes to handling data specific to a specialized domain. There is currently a lack of domain-specific pre–trained models that would further enhance the processing of natural language artefacts related to software engineering. To this end, we release a word2vec model trained over 15GB of textual data from Stack Overflow posts. We illustrate how the model disambiguates polysemous words by interpreting them within their software engineering context. In addition, we present examples of fine-grained semantics captured by the model, that imply transferability of these results to diverse, targeted information retrieval tasks in software engineering and motivate for further reuse of the model.
Tue 29 MayDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
11:00 - 12:30 | |||
11:00 6mShort-paper | 50K-C: A dataset of compilable, and compiled, Java projects Data Showcase A: Pedro Martins University of California at Irvine, USA, A: Crista Lopes University of California Irvine, A: Rohan Achar | ||
11:06 6mShort-paper | JBench: A Dataset of Data Races for Concurrency Testing Data Showcase A: Jian Gao School of Software, Tsinghua University, A: Xin Yang , A: Yu Jiang , A: Han Liu , A: Weiliang Ying , A: Xian Zhang | ||
11:12 6mShort-paper | Bugs.jar: A Large-scale, Diverse Dataset of Real-world Java Bugs Data Showcase A: Ripon Saha , A: Yingjun Lyu University of Southern California, A: Wing Lam University of Illinois at Urbana-Champaign, A: Hiroaki Yoshida Fujitsu Laboratories of America, Inc., A: Mukul Prasad Fujitsu Laboratories of America | ||
11:18 6mShort-paper | A Gold Standard for Emotion Annotation in Stack Overflow Data Showcase A: Nicole Novielli University of Bari, A: Fabio Calefato University of Bari, A: Filippo Lanubile University of Bari Pre-print | ||
11:24 6mShort-paper | Vulinoss: A Dataset of Security Vulnerabilities in Open-source Systems Data Showcase A: Antonios Gkortzis Athens University of Economics and Business, A: Dimitris Mitropoulos , A: Diomidis Spinellis Athens University of Economics and Business Pre-print | ||
11:30 6mShort-paper | A Dataset of Duplicate Pull-requests in GitHub Data Showcase A: Zhixing Li College of Computer, National University of Defense Technology, Changsha, China, A: Yue Yu National University of Defense Technology, A: Gang Yin National University of Defense Technology, A: Tao Wang National University of Defense Technology, A: Huaimin Wang Pre-print | ||
11:36 6mShort-paper | Structured Information on State and Evolution of Dockerfiles on GitHub Data Showcase DOI Pre-print | ||
11:42 6mShort-paper | A Graph-based Dataset of Commit History of Real-World Android apps Data Showcase A: Franz-Xaver Geiger , A: Ivano Malavolta Vrije Universiteit Amsterdam, A: Luca Pascarella Delft University of Technology, A: Fabio Palomba , A: Dario Di Nucci Vrije Universiteit Brussel, A: Alberto Bacchelli University of Zurich DOI Pre-print | ||
11:48 6mShort-paper | Public Git Archive: a Big Code dataset for all Data Showcase DOI Pre-print | ||
11:54 6mShort-paper | Word Embeddings for the Software Engineering Domain Data Showcase A: Vasiliki Efstathiou Athens University of Economics and Business, A: Christos Chatzilenas , A: Diomidis Spinellis Athens University of Economics and Business DOI Pre-print | ||
12:00 6mShort-paper | npm-miner: An Infrastructure for Measuring the Quality of the npm Registry Data Showcase A: Kyriakos Chatzidimitriou Aristotle University of Thessaloniki, A: Michail Papamichail , A: Themistoklis Diamantopoulos Electrical and Computer Engineering Dept, Aristotle University of Thessaloniki, A: Michail Tsapanos , A: Andreas Symeonidis DOI Pre-print | ||
12:06 6mShort-paper | CROP: Linking Code Reviews to Source Code Changes Data Showcase A: Matheus Paixao University College London, A: Jens Krinke University College London, A: DongGyun Han University College London, A: Mark Harman Facebook and University College London DOI Pre-print | ||
12:12 6mShort-paper | Developer Interaction Traces backed by IDE Screen Recordings from Think-aloud Sessions Data Showcase A: Aiko Yamashita Oslo Metropolitan University, A: Fabio Petrillo Concordia University, A: Foutse Khomh Polytechnique Montréal, A: Yann-Gaël Guéhéneuc Concordia University and Polytechnique Montréal Pre-print | ||
12:18 6mShort-paper | A Multi-level Dataset of Linux Kernel Patchwork Data Showcase DOI Pre-print | ||
12:24 6mShort-paper | Documented Unix Facilities Over 48 Years Data Showcase Link to publication DOI Media Attached |