from 17:00 pm to 17:45 pm
Nowadays knowledge about the behaviour and preferences of users of web portals is highly desirable. Profiling technologies can be applied across a variety of different domains such as personalised advertising, user analysis and statistics, customer relationship management and content customisation. Agora S.A., one of the biggest Polish media companies is working on user profiling technologies. There are many known and less known techniques for profiling users offline. However, the Big Data Department of Agora designs and develops the system for real-time user profiling, which produces a number of technological challenges and research problems.
Agora owns dozens of themed, classified, entertainment and social services. There are news and sports portals, forums, advertising services, blogs and many other thematic websites. All sites generate over 400 page views per second (under normal conditions) and considerably more events (likes focus, clicks and scrolling events). It raises one question: how to build user profiles real-time in such a dynamic and changing environment?
In our presentation we shall prove that it is possible to build the system that will be efficient and easy to develop and maintain. As mentioned above, we have data on user page views and related events. A single page view contains additional information such as geolocation, parsed User Agent and referrer URL. Therefore, it is possible to create a system in which a data stuart can define rules such as: 'if a user, who is from a specific city, visits a given website and reads an article (to a certain point) containing a certin tag then update segment 'Reader' by 1 value and segment 'City' by the city name’? Can the data from different data streams be unified and have a uniform structure? Can user features contain different types of data (e.g., boolean, numbers, strings)? Can the user profile be composed of data from a few days back? Is it possible to share profiles updated real-time fast? Is there a way to modify configuration of profiling system without rebooting? Can we easily solve the issue of permissions to the data for multiple clients of profiling service?
There is one answer to these questions: Yes! We will show that, by utilizing fast large-scale data processing engine Spark with the Streaming library, it is possible to build efficient and production-ready system for user profiling. We will present that distributed and scalable HBase database can be effectively used to support such a dynamic environment. In addition, we should not forget to list other technologies used by our system, such as Kafka messaging system, Redis cache, MySQL or Spring framework. We encountered many problems and technical obstacles throughout the project and we will be happy to share our recipes to solve them. Moreover, we will try to show how to enrich such a user profiling system by the machine learning technologies.
There are many interesting methods of solving the problem of user profiling. Furthermore, there are many big data technologies for real-time processing. We want to show how to combine these methods and technologies and build a production system. The system that is the core of Agora's Big Data.