Social Media: Monitoring and Analysis System

CHALLENGE

The developed System is an automatically composed news feed. User activity is monitored and a customized
news feed is generated based on the user’s interests.

The System continuously collects information from more than 50 000 sites in the English part of the
Web and processes at the average 80 000 new articles per day. Besides, it updates data on more
than 1 million of already existing articles.

The client requested expert review from Byndyusoft after one of the leading developers of custom
designed software in the Eastern Europe failed to design and implement a scalable solution with
sufficient performance.

Problems of version 1.0:

The System was a set of isolated projects, unconnected with each other.

Delivery of data required manual adjustment of each project for every separate site. This did not
allow to increase the delivery of new articles.

Data processing did not provide required integrity — up to 50% of articles
disappeared, processing errors were not logged in a proper manner.

Because of monolithic architecture there was no possibility of cheap horizontal scaling. All
subsystems could be run only on a single server in one thread.

Processing of data lagged behind the input of new data, no optimization was performed.

Output of content to user was irrelevant, algorithms had some major errors.

Reasons of the problems:

The project had no integral architecture. Every service was developed separately. The services could
interact with each other only through the shared DB.

Usage of common DB resulted in constant blockings with hanging of all parts of the
system.

The system depended on external services, which provided limited amount
of API requests, and did
not ensure necessary data integrity.

Many services were not tested, had critical bugs and problems related to memory leak.

Works on version 1.0 were performed by a team of 4–6 high-priced IT specialists for about half a year.

After Byndyusoft took over the development, the tasks for the first six months were stated as follows:

Ensure continuous conveyor delivery of data.

Provide an opportunity for scalability of data delivery services and front-end sites.

Ensure the integrity of downloaded data, and for this purpose it was necessary to increase
delivery by a factor of 8–10.

Improve the quality of data processing, add missing services. Main services included clearing of text
from advertisements, analysis of similar pictures, analysis of similar texts etc.

Completely rewrite user interface, optimize the speed of endless feed for all modern browsers and
mobile devices.

Ensure integration with social networks, particularly with Facebook — release an application
for Facebook.

Arrange fast and relevant output of content to users.

Ensure fail-safety and replication of the whole system.

Conclusions, drawn by the Byndyusoft team from mistakes, which had been made by the previous team in the
design of the system, wide experience in development of high load systems and professional usage
of flexible development techniques allowed to create a new version of the System in compliance
with all client’s requirements in the shortest time possible.

SOLUTION

Within 6 months, a team of 6 persons developed an operational project, meeting all the assigned objectives.

Conveyor for processing of articles and images

Byndyusoft team has designed and created the required conveyor of full article processing cycle:

Downloading and recognition of article list from the source (RSS-feed, custom
API for the sources),
obtainment of links to articles.

Clearing of links from all redirects and obtainment of final link to an article.

Downloading of HTML page with the
article.

Locating of article body.

Tagging of the article — text analysis and highlighting of key words.

Clearing of the article from advertisements and «garbage» sections.

Locating and downloading of images/videos, relating to the article.

Checking if the text of the article completely matches with already available texts, checking of similarity
of the text with already existing ones, so that there were no duplicates in the user
feed.

Checking if the images completely match with already available images, checking of similarity of the
images with already existing ones, so that there were no duplicates in the user feed.

Checking in the browser if the original article can be displayed in iframe.

Identification of social characteristics of the article — number of likes/reposts
of the article in Facebook, number of comments to the article, number of tweets
with a link to the article.

Architecture and horizontal scalability

Scalable model of services, interacting through common data bus (RabbitMQ). Processes of reading and
writing to database are carefully assigned between the services to prevent interlocking. All
information for content output to the user is copied to MongoDB cluster to increase the
speed of data output.

Created mechanism of data output to the UI was designed and developed from scratch. In the
process of designing main focus was on fast and relevant output of content to users. All
front-end
sites synchronize cache among each other, and interact with delivery services by means of common bus.
Because of this, new articles may make it into the output to users before the article is physically
saved in the DB.

Based on zabbix and in-house
projects a powerful tool for data analysis and tracking of current state of operation of all
project infrastructure was created.

Android client

API provided by web servers for the
site UI allowed to create Android application without any changes on the server part. The net
result is that both main and mobile versions of site UI, and Android application use the same API.

RESULTS

Results of work, provided by the Byndyusoft team had the following
advantages for the client:

Byndyusoft was able to provide an operable
solution, which completely satisfies the client.
Initial working versions were presented already after
a month from the start of development,
while the previous team could not provide
any working version within 6 months of works
on the project.

Byndyusoft was able to fulfill all requirements related
to quality and speed of data processing within 6 months
and at the same time to ensured performance margin due
to which initial set of data sources was extended afterwards.

Arrangement of fail-safety and testing of different failure scenarios
allowed to avoid interruptions in the system operation in the event
of real equipment failures and emergencies.

Duration of daily consultations with representatives of the client reduced
from 2 hours in the first days of the project to 15 minutes by the second month
of development.