Our current directions
Social media including popular forums, blogosphere, and social networks has evolved tremendously in recent decade and we can witness the proliferation of many successful services and applications: Twitter, Flickr, Youtube, Facebook, Hyves, Livejournal etc. Data generated but lots and lots of users of these applications allow for many interesting studies that were hard to imagine before. How social media is produced, cited and used; how various information propagates being searched and found, how social communities evolve, what people talk or argue about, what opinion people have - these and many other questions are interesting to academia and businesses.
Social media mining is aimed to facilitate traditional and new kinds of search, recommendation, and predictive modeling tasks. In the recent past we studied several topics related to social media mining.
Erik Tromp in his thesis Multilingual Sentiment Analysis on Social Media investigated automated sentiment analysis on multilingual data from different social media including Twitter. We studied a four-step approach solving this problem, comprising language identification, part of speech tagging, subjectivity detection and polarity detection. For language identification and polarity detection Erik presented new algorithms called LIGA and RBEM respectively. The experimental study illustrated the benefit of each of the steps in the four-step approach and allowed to quantify the importance of having the output of the corresponding techniques at each step as accurate as possible.
Murat Ongun in his thesis Utilizing Social Media Data for Search Engine Marketing studies how to align streaming data from the social media with web analytics data and facilitate its mining for different search engine optimization tasks including additional keyword generation, finding patterns related to geographical regions and trends detection for managing keyword bids.
Samuel Louvan in his thesis
Web Page Segmentation & Structure Analysis for Eliminating Nonrelevant Content studied how to identify relevant content in social media websites including blogs and forums.
Most of the previous approaches used heuristic rule sets to locate the main content. Our contribution in
this work is mainly the development of web content extraction module which uses a
hybrid approach that consist of machine learning and heuristic
approaches developed by Samuel, namely Largest Block String, String Length Smoothing, and Table Pat-
tern. According to our experiments, the combination of machine learning and heuristic
approach gives encouraging result and it is a competitive content extraction method
compared to the current state of the art web content extraction methods.
Collaborators
- Dept. Computer Science, TU/e
- Mykola Pechenizkiy
- Erik Tromp
- Murat Ongun
- Samuel Louvan
- Multiscope
- Renzo de Hoogen
- Renzo de Hoogen
-
Adversitement
- Guido Budziak
- Bob Nieme
- Teezir
- Arthur van Bunningen
Publications
- Erik Tromp and Mykola Pechenizkiy. Graph-Based N-gram Language Identification on Short Texts. Benelearn 2011.
Code & Datasets
We are working on making the software, source code and datasets created and used in this project available for the research community (as long as there are no NDA, IP ethical or proprietary concerns). Currently, the following datasets are available:
- LIGA_Benelearn11_dataset.zip (description.txt) Preprocessed labeled Twitter data in six languages, used in Tromp & Pechenizkiy, Benelearn 2011
- SA_Datasets_Thesis.zip (description.txt) All preprocessed datasets as used in Tromp 2011, MSc Thesis




