Source Code Authorship Identification Using Tokenization and Boosting Algorithms

Gorshkov, S.; Nered, M.; Ilyushin, E.; Namiot, D.; Sukhomlin, V.

Авторы: Gorshkov Sergey, Nered Maxim, Ilyushin Eugene, Namiot Dmitry, Sukhomlin Vladimir
Журнал: Communications in Computer and Information Science
Том: 1201
Год издания: 2020
Первая страница: 295
Последняя страница: 308
DOI: 10.1007/978-3-030-46895-8_23
Аннотация: Each programmer has his unique coding style. Identification source code authorship solves the problem of determining the most likely creator of the source code, in particular, for plagiarism and disputes about intellectual property violations, as well as to help in finding the creators of malware. Extraction a unique style helps to maintain the uniformity of code in repositories, considering the different influence of programmers. Currently, methods based on random forests and abstract syntax trees, short n-grams for structure preservation and Bayes classifier and others are proposed. We present a new model, called StyleIndex, based on tokenization and tools for analyzing the semantics of programming languages and context of tokens in the program text, and extraction unique author’s style Index. The algorithm applies to various programming languages and shows very high classification accuracy. Moreover, our algorithm is able not only to correlate the source code and its creator, examples of programs which are available for training, but also to divide the program into categories by the alleged authors and have trained on other authors, thereby extraction the components define the style as a global concept, independent from specific authors. The main factors that determine the style are also identified.
Добавил в систему: Намиот Дмитрий Евгеньевич

	ИСТИНА	Войти в систему Регистрация
	ИСТИНА ИНХС РАН
	Главная Поиск Статистика О проекте Помощь

ИСТИНА

ИСТИНА ИНХС РАН

Source Code Authorship Identification Using Tokenization and Boosting Algorithmsстатья