KAIST-Elice Data Scientist Edu-Challenge Review


This online challenge, run by KAIST and Elice, started from December 2017 and lasted for about 1 month. People who participated in the challenge not only included college students who learned data structure for the first time, but also data scientist experts who are actually working in the field.


There were a total of 4 rounds during the challenge: the 1st round happened on the week of December 4th and the last round happened on the week of December 26th. On January 13th, challenge participants had a workshop session which consisted of the award ceremony and networking session. We saw a great competition among many talented individuals. At the end of all 4 rounds, a total of 7 outstanding individuals were prized based on their performance during the challenge: 1 grand prize, 2 gold prizes, and 4 silver prizes. Participants who showed significant enthusiasm towards the challenge were given hard-worker prize.

This challenge was structured in a way so that the problems can be solved with basic, elementary-level algorithms. Participants who used a more complex model to solve the problem were given bonus points. During the challenge, participants were required to utilize real-world data instead of virtual, made-up data. This challenge was run in various parts, each with different topics that the participants were likely to be interested in. The following introduces the purpose of the proposed problem and problem solution method for each round. Purpose of the problems were presented by Elice and the problem solution methods were presented by excelling participants from each round.


Before the data scientist problem solution session, there was PT run by full-time data scientist expert. He was not originally a data scientist; after receiving education through Elice, he changed his career and is currently working in Satrec Interactive as a server developer and data scientist. The presenter took a 16-week education class at Elice and worked on a project in which he made a chatbot that mimics the speech patterns of a member of the National Assembly.

The National Assembly conference log from July 2nd, 2012 to May 19th, 2016 were first dataset. Then, for proposed models, a model that utilizes the models divided into various topics, conservatives/progressive political party model, and the whole sentence + other additional information was constructed. As a result, it was possible to implement using LSM and provided UI through Web. It was also provided through Kakaotalk Plus friend account. Through this education process, it was possible to build data manipulation skills and learn how to attain certain types of data (crawling). At the same time, it was possible to improve on problem solving skills for deciding which model is adequate for which situation. Moreover, the education process enabled the development of system composition and implementation skills. The presenter commented that it is possible for him to contact various companies through recruiting data and currently stand as a data scientist thanks to the education he received from Elice.


P1. Emotional Analysis of Naver Film Scores

The first problem was to find out the right emotion contained in the 140 word commentary from Naver Film service. Performance is measured (accuracy) for the star rating in the short commentary. The following is the purpose of the problem.

“Classic NLP”

Seed word expansion, Naive Bayes. WordNet, BOW+SVM. pLSI/Topic Modeling, x-level Word Embdding.

The following is the solution suggested by the presenter.

To begin with, the presentor used Scikit-learn from baseline model and conducted Naive Bayes, Random Forest, SVM, which summed up to approximately 60 points. Then, PyTorch was used from Baseline deep learning model and RNN-Bidirectional GRU summed up to approximately 70 points. In Tuning, similar level of performance could be observed by putting Conv layer in front of CRNN-RNN model. Attention scored 71 points and thus it could be seen than GRU was better for BiLSTM. For bigger sets of data, Naver Film commentary were crawled and about 3,000,000 cases were utilized, scoring BiGRU 86 points and BiGRU + Attention 88 points.


P2. Predicting the Results of League of Legend Pro-Gamer Competition

The second problem started out with information for each team and each team member from 38,679 EU Master Daejeon data (2016). This challenge had the most participants as the game proposed in the problem is quite popular. Performance measurement for each team was done through Accuracy. The following is the purpose of the problem.

“Feature Engineering”

Normalization, Linear/Logistic Regression, Decision tree/Regression tree SVC/SVR, Factor Analysis, Probabilistic Model, DNN.

The solution proposed by an outstanding participant received about 85 points by initially using only information regarding game champions RF, XGB, DNN. Secondly, the participant used information regarding the summoner and received data augmentation approximately 85 points. Then, combining champion information and summoner information usage, 85 points were given. Unlike the expectations, there were no increase in performance when the two types of information were combined.


P3. Deducing the Name of Press Company Based on News Article Text

Recently, there have been an increase in the number of fake news and there are slight differences in connotations or word choices among existing press companies. This problem required participants to not analyze the writing style of news articles, but to analyze certain given data and find the corresponding press company. They were required to analyze a variety of press companies from companies with low probability to that with high probability. Performance evaluation was done through MRR (Mean Reciprocal Rank).

Text classification

Feature Engineering(text), Word Count Bag of Words, SVC, Topic Models Ideal Point Model(in Politics), x-level Word Embedding.

The 3rd round best solution is as follows. In order to deduce the press company, the style, word choice, format, and tone were analyzed and focused on the differences in the text layout for interviews and uploading news articles. First, the solution requires syllable-unit based learning (the location of the syllable and the syllable order). Next, special characters were preserved and were utilized in the learning process. CNN or RNN were used for the syllables. A syllable dictionary was constructed by mapping each syllable in the data set with integers. Data sentence length for learning were set to 900; for short lengths, the leftover spaces were filled with padding. After that, a model was constructed and thus could attain a prediction. The results summed up to 94 points and the structure of combining CNN and syllable embedding and Dripout seemed to contribute significantly. It could be seen that the performance level is higher than the case when sentence length is set to 500.


P4. Predicting the Future Stock Price through Graphs

The participants were given a stock graph without the name of the business, the number of employees, the value of the stock, and the business area. The problem was to predict the price of the stock solely by looking at the graph. Participants were required to analyze stock graph from January 30th, 2010 to December 30th, 2016 and predict the closing price of December 31st, 2016. Performance evaluation was based on the direction of the closing market price (rise/fall): 50 points, actual market closing price (MSE): 50 points.

Just print the last price, Moving average, Momentum, Black-Scholes ,Thousand of Heuristics and Theories, Bayesian Probabilistic Models , RNN.

The stock price prediction problem solution used training data format to calculate the performance. Performance value is measured in MSE(Mean squared error). In data processing, the volume value transformation, close_diff  variable addition: closing value/ maximum, minimum change range were set.

X:high,low,open,close,volume,close_diff(6 vars)

Y(target): next day close for X

Do not distinguish data, symbol. train:data<=’2016-12-29’

test:date==’2016-12-29’

The algorithm tried were Random/ForestRegressor/Extra Tree Regressor/Gradient Boosting Regressor and in the final selection, 2 hidden layers, 1 final output came out as the result. The final results summed up to 82 points.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *