Category: Summer of Code

We had an in-person meeting at the MTG during the MetaBrainz summit to discuss the status and future of AcousticBrainz. We came up with a rough outline of things that we want to work on over the next year or so. This is a small list of tasks that we think will have a good impact on the image of AcousticBrainz and encourage people to use our data more.

State of AcousticBrainz

AcousticBrainz has a huge database of submissions (over 10 million now, thanks everyone!), but we are currently not using the wealth of data to our advantage. For the last year we’ve not had a core developer from MetaBrainz or MTG working on existing or new features in AcousticBrainz. However, we now have:

Philip, who is starting a PhD at MTG, focused on some of the algorithms/data going into AcousticBrainz

Alastair, who now has more time to put towards management of the project

Because of this, we’re glad to present an outline of our next tasks for AcousticBrainz:

Short-term

Some small tasks that are quick to finish and we can use to show off uses of the data in AcousticBrainz

Merge Philip’s similarity, including an API endpoint

Philip’s masters thesis project from last year uses PostgreSQL search to find acoustically similar recordings to a target recording. This uses the features in AcousticBrainz. We need to ensure that PostgreSQL can handle the scale of data that we have.

An extension of this work is to use the similarity to allow us to remove bad duplicate submissions (we can take all recordings with the same MBID and see if they are similar to each other, if one is not similar we can assume that it’s not actually the same as the other duplicates, and mark it as bad). We want to make these results available via an API too, so that others can check this information as well.

Merge Existing PRs

We have many great PRs from various people which Alastair didn’t merge over the last year. We’re going to spend some time getting these patches merged to show that we’re open to contributions!

Publish our Existing models

In research at MTG we’ve come up with a few more detailed genre models based on tag/genre data that we’ve collected from a number of sources. We believe that these models can be more useful that the current genre models that we have. The AcousticBrainz infrastructure supports adding new models easily, so we should spend some time integrating these. There are a few tasks that need to be done to make sure that these work

Ensure that high-level dumps will dump this new data (If we have an existing high-level dump we need to make a new one including the new data)

Ensure that we compute high-level data for all old submissions (we currently don’t have a system to go back and compute high-level data for old submissions with a new model, the high-level extractor has to be improved to support this)

Update/fix some pages

We have a number of issues reported about unclear text on some pages and grammar that we can improve. Especially important are

Front page (Show off what we have in the project in more detail, instead of just a wall of text)

Data page (instead of just showing tables of data, try and work out a better way of presenting the information that we have)

Fix Picard plugin

When AB was down during our migration we were serving HTML from our API pages, which caused Picard to crash if the AB plugin was enabled while trying to get AB data. This should be an easy fix in the Picard plugin.

High Impact

These are tasks that we want to complete first, that we know will have a high impact on the quality of the data that we produce.

Frame-level data

We want to extract and store more detailed information about our recordings. This relies on working being done in MTG to develop a new extractor to allow us to get more detailed information. It will also give us other improvements to data that we have in AB that we know is bad. This data is much bigger than our current data when stored in JSON (hundreds of times larger), so we need to develop a more efficient way of storing submissions. This could involve storing the data in a well-known binary data exchange format. A bunch of subtasks for this project:

Finish the essentia extractor software

Decide on how to store items on the server (file format, store on disk instead of database)

Work out a way to deal with features from two versions of the extractor (do we keep accepting old data? What happens if someone requests data for a recording for which we have the old extractor data but not the new one?)

Upgrade clients to support this (Change to HTTPS, change to the new API URL structure, ensure that clients check before submission if they’re the latest version, work out how to compress data or perform a duplicate check before submission)

Deduplication (If we have much larger data files, don’t bother storing 200 copies for a single Beatles song if we find that we already have 5-10 submissions that are all the same)

MusicBrainz Metadata

Rashi’s GSoC project in 2018 helped us to replicate parts of the MusicBrainz database into AcousticBrainz. This allows us to do amazing things like keep up-to-date information about MBID redirects, and do search/browse/filtering of data based on relationships such as Artists just by making a simple database query. We want to merge this work and start using it.

Dumps

When we changed the database architecture of AcousticBrainz in 2015 we stopped making data dumps, making people rely on using the API to retrieve data. This is not scalable, and many people have asked for this data. We want to fix all of the outstanding issues that we’ve found in the current dumps system and start producing periodic dumps for people to download.

Build more models

In addition to the existing models that we’ve already built (see above, “Publish our Existing models”), we have been collecting a lot of metadata that we could use to make even more high-level models which we think will have a value in the community. Build these models and publicly release them, using our current machine learning framework.

Wishlist

These are tasks that we want to complete that will show off the data that we have in AcousticBrainz and allow us to do more things with the data, but should come after the high-impact tasks.

Expose AB data on MusicBrainz

As part of the process to cross-pollinate the brainz’s, we want to be able to show a small subset of AB data that we trust on the MB website. This could include information such as BPM, Key, and results from some of our high-level models.

Improve music playback

On the detail page for recordings we currently have a simple YouTube player which tries to find a recording by doing text search. We want to improve the reliability and functionality of this player to include other playback services and take advantage of metadata that we already have in the MusicBrainz database.

Scikit-learn models

The future of machine learning is moving towards deep learning, and our current high-level infrastructure written in the custom Gaia project by MTG is preventing us from integrating improved machine learning algorithms to the data that we have. We would like to rewrite the training/evaluation process using scikit-learn, which is a well known Python library for general machine learning tasks. This will make it easier for us to take advantage of improvements in machine learning, and also make our environment more approachable to people outside the MusicBrainz community.

Dataset editor improvements

Part of the high-level/machine learning process involves making datasets that can be used to train models. We have a basic tool for building datasets, however it is difficult to use for making large datasets. We should look into ways of making this tool more useful for people who want to contribute datasets to AcousticBrainz.

Search

With the integration of the MusicBrainz database into AcousticBrainz, we will be able to let people search for metadata related to items which we know only exist in AcousticBrainz. We think that this is a good way for people to explore the data, and also for people to make new datasets (see above). We also want to provide a way that lets people search for feature data in the database (e.g. “all recordings in the key of Am, between 100 and 110BPM”).

API updates

As part of the 2018 MetaBrainz summit we decided to unify the structure of the APIs, including root path and versioning. We should make AcousticBrainz follow this common plan, while also supporting clients who still access the current API.

We should become more in-line with the MetaBrainz policy of API access, including user-agent reporting, rate limiting, and API key use.

Request specific data

Many services who use the API only need a very small bit of information from a specific recording, and so it’s often not efficient to return the entire low-level or high-level JSON document. It would be nice for clients to be able to request a specific field(s) for a recording. This ties in with the “Expose AcousticBrainz data on MusicBrainz” task above.

Everything else

Fix all our bugs and make AcousticBrainz an amazing open tool for MIR research.

Thanks for reading! If you have any ideas or requests for us to work on next please leave a comment here or on the forums.

Hello,
I am Chhavi. I have mostly been helping around with all things design in MetaBrainz. I recently graduated from IIT Guwahati, India and started contributing to MusicBrainz after attending the summit last year, around the same time.

As a Google Summer of Code student, my project was to build a design system with React UI components for the upcoming overhaul of MusicBrainz’s website. It surely was a really interesting journey, right from when I heard about the community and I would like to share some snippets of it with you!

May 2017: I hear about Picard, and how a bunch of really cool people who meet online are building it. I was intrigued.

Around August 2017: I pop in the IRC channel #metabrainz, and after much overthinking, I drop a “Hi”. Followed was a really warm welcome by people I will soon call friends and a lot of developer-y jargon I had no clue about.

September 2017: I attend the annual MusicBrainz developer summit in Barcelona. And boy oh boy, I am now part of the family. Over the few days there, I have immense fun interacting and learning from the community.

November 2017: We set up our JIRA ticket system for design issues and start working on the mockups for the redesign. The entire community comes together on JIRA tickets and Discourse posts to talk about where we want to go with this overhaul.

January 2018: Community members encourage me to try my hand at front-end development. One is really lucky to find people, who encourage you to grow out of your comfort zone and help you cross that wall. In MetaBrainz, there is no shortage of such kind of people.

March 2018: With little confidence and lots of hopes, I apply for the Google Summer of Code programme. I start learning the ropes of development, with help of online tutorials and obviously our community. We also met for a mini-summit in Delhi to discuss ListenBrainz and spicy food.

April 2018: Hence began my full-fledged journey of learning and spending a summer of coding. It wasn’t easy, but I learned a lot in the process.

We set up the initial design system using react-bootstrap and react-storybook. I then started importing UI components into the system, followed by its documentation. I wrote up a more detailed description of the process too.

August 2018: As of now, we have the design system in place. The future plan is to continue adding components to it as well focus on having well thought contributing guidelines. I will also continue working on designing the mockups for the user interface for various entities.

Google Summer of Code was just another milestone in my journey with MetaBrainz. My time here has been a time of both personal and professional growth. I now feel more comfortable in a development environment, the ongoing chats on IRC make more sense to me and I feel less inhibited to put my thoughts out there. I completed my college, moved cities, traveled… all while having a set of these amazing people I call family.

A special shout out to Rob for keeping me going, bitmap for being ever so patient and understanding, samj1912 for introducing me to MetaBrainz, CatQuest, iliekcomputers, Suyash, Freso, reo and zas for being amazing friends through it all.

The thing I like about our community is, we had seasoned developers as well as newbies like me, all together working together to create amazing stuff. Hoping to continue being an involved and colorful part of this community,

I’m Shivam Tripathi, an undergraduate student from the Indian Institute of Information Technology, Una. I interned for the MetaBrainz foundation under the Google Summer of Code programme for the year 2018 and worked on the BookBrainz project. I was mentored by Ben Ockmore during this period. This post summarizes my contributions to the project and experiences that I had throughout the summer trying to solve various problems related to the implementation of the project.

Proposal

The original proposal I submitted to Google underwent some modifications as the project progressed, details of which can be found later in this post.

Community Bonding

Summer of Code started with the community bonding period – during which I attended the regular Monday meetings at MetaBrainz’s IRC channel #metabrainz and interacted with the MB community members. I added multiple new entities to the BookBrainz’s website and helped some users with BookBrainz related queries on the community page (intended for support/general QA related to all of MetaBrainz projects).

Also during this period, my mentor Ben Ockmore and I discussed and finalized the architecture of the importer application. It was decided to split the entire importer into two microservices: one for producer (which reads the data dumps and produces generic objects for each record using BookBrainz data storage format) and the other of consumer (which reads and validates the generic object and then insert them into the BB database). It was decided to connect these microservices using a message broker queue (RabbitMQ was finalized). In addition to this, the code repository architecture was decided to be such that we should be able abstract away the entire message broker logic, so that later it would be possible to swap out RabbitMQ with any other hosted service later (like pubsub).

Fig. 1: Initial design of the intended importer application. For more information, visit the original document.

Coding period

First phase

The program coding period kicked off with making changes to the existing BookBrainz schema to enable it support our new imports. The initial design as discussed here was later updated to include views as well for imports per entity to enable simpler queries.

Following this, I started working on the bookbrainz-data repository to add some basic functions for aiding the import process. I started work in accordance with one of the existing roadmaps for the BookBrainz project which was to shift all database logic from bookbrainz-site to bookbrainz-data – adding features on a per-function basis. Initially it was decided to use Immutable.js for all data flowing in and out of bookbrainz-data-js, but very soon we realized that it was not practical to follow this approach. After some discussions, we finally settled on this repository design change to incorporate new function-oriented functionality. We named this new sub-module func.

Once I had basic functions to handle database transactions in place, I started working on the importer architecture. It was decided to create multiple instances of the producer process each with the ability to run asynchronous operations on it’s own. Similarly, we should be able to fork multiple consumer process, each capable of fetching data from queue and sending it off to the database.

To address this problem, I started working on a module which given a function would make it possible to run multiple processes running multiple instances of the given function. It should be such that we can generate the arguments dynamically for each process and along with some set-up and tear-down actions before and after we fork the process.

To get a better grasp of underlying functionality, one can read the final API and documentation. It’s a generic module which can be used for any functionality. The diagram for it’s execution flow can be found below:

Fig 2. AsyncCluster module execution flow. For more details, see the complete documentation.

Second phase

While developing producers, I first designed the generic producer object structure for all entity types – an object skeleton which all producers need to create from the read records to be pushed into the queue. This object structure was to be enforced across all data sources, and this helps the consumers to expect an object of fixed nature on which they can later run automated validation tests prior to adding to the database.

As the data dumps were of considerable size, I used data streams to read the data from the flat files and parsed them to create generic entity objects which used BookBrainz data storage format. After parsing each record, I pushed the records into the queue.

Parsing required thorough analysis of the data dumps, and manually mapping each key-value pair in the data dump to the generic object structure. All the data which did not fit the present BB schema (and hence had to be excluded from the generic producer object structure) was added to a metadata field associated with the import. This metadata field is stored as a bjson in the database, so that we can individually query and index any of the fields in the metadata later.

While developing the consumers, I initially set up a validation module. Much of it was adapted from the existing validators on BookBrainz site, which I was able to use without much alteration thanks to the generic producer object structure. The validation modules in the bookbrainz-site have been written to quickly validate the form submitted by the editor post creation/editing of an entity. To use them in the import process, I wrote a converter which transforms the generic producer object to form sections understood by the validation module. Apart from this, I added better error handling to ensure all errors are caught and reported in case a something goes wrong.

Error handling was another important aspect of the import process, apart from the validation process. Being a command line application, tracking errors was central to ensuring that all components were running as intended. At the same time we had to ensure that no record which could be potentially imported into the database was missed. To address these problems, I decided to discard the record if it fails the data integrity validation tests (which means the data is most probably corrupt). But in the case of any transaction error, we give a fixed number of chances to the record before discarding it (by acknowledging the message). A future goal for this process could be to push the erred record into another queue for analysis and replaying of those messages from this queue back to the original queue when the problem gets sorted.

Once the importer was in place, I focused on building up the func.imports module with more functions for the import entities – like discard and approve. I also added functions to fetch recent imports, and a lot other helper code for the imports. I also ensured that all errors occur loudly and never silently slip away. With the help of my mentor, I also migrated most of the functions required for data transactions on the bookbrainz-site. This was crucial to my project – as in many instances the existing functions could not be used due to them initiating their own database transaction for each action. I split all these actions into functions, and bound them with the transaction object they received rather than initiating their own transaction. I also ensured we use modern ES6 features – which made the adapted code much more sleek and compact. It was a long process, as I had to read almost entire of existing code for data transactions on the bookbrainz-site and adapt each of them correctly. All the code finally came together in the create-entity module – which can now be used for entity creation as well as upgrading the imports to entities.

Third Phase

The work on bookbrainz-site and bookbrainz-data mostly happened side by side. First I added a recent imports page – which would fetch most recent imports from the database and display them inside the React component. The recent imports is designed as a single page application which dynamically loads the paginated records and renders them on-screen. The working of the recent page application is as follows:

Fig. 3: Recent imports execution flow

Next, I added import-entity display pages for all five entity types. They were supposed to display the entity attributes along with links to approve/discard/edit and approve functionality. Approving the import-entity was done so that the user gets redirected to the newly created entity. The import-entity display page for work is as follows:

In case of a discarded import, I added an extra page similar to existing confirm deletion page – which asks the user to confirm the action and then waits until the entity is deleted before redirecting the user to the home page. The discard page looks as follows:

Fig. 5: Discard Import Entity Page

Next, I implemented the editing imports prior to approval. For this, I wrote two modules – one to transform the import to the structure used by the editing form and one to convert form data to an entity. When a user wishes to edit an import, the import is transformed to the form and rendered on the screen. The user can then edit the import. When the user clicks submit, we transform the form data to a new entity type and use the create-entity function to create a new entity in the BookBrainz database. The user is then redirected to the newly created entity page. The code for rendering the form and editing the entity was completely reused for imports with minimal changes. I then added functionality to add imports to the ElasticSearch index, and display them in present results. The final search page is as follows:

Fig. 6: Search showing Import Entities

Links to the work done

Conclusion

Last three months have been a fantastic experience for me. Not only did I get to learn a lot of new technologies and write some exciting software, but also I got to brush up my existing skills and interact with the completely awesome MetaBrainz community. Such an opportunity comes truly once in a lifetime, and I extend my sincerest gratitude to Google for running such a great and extremely inclusive programme which allows students from all over the world to avail such an opportunity. Special thanks to my mentor Ben Ockmore for always being patient and helping me out whenever I felt stuck.

Thank you MetaBrainz community for your continuous guidance and support!

Hi, I’m Leo and I spent my summer building and training SpamBrainz, our new solution to fighting spam in MusicBrainz. If you haven’t heard of SpamBrainz before it’s probably because it did not exist before this year’s Summer of Code.

For quite a while now the amount of spam in MusicBrainz has started to become a serious problem. Often this means editors are automatically created with descriptions that look not unlike the spam emails most of us get every day, promoting other websites and services.

During last year’s MetaBrainz Summit we discussed possible solutions to this and came up with the Spam Ninja system. Essentially this means that Soon™ there will be a group of editors that receive spam reports and have the ability to delete editors and entities that are nothing but spam.

Now with MusicBrainz having almost two million registered editors, could we really expect the Spam Ninjas to manually check every single one of them in addition to all the new registrations? Obviously not, and this is where SpamBrainz comes in.

SpamBrainz is a machine learning system that looks at all editors and decides whether or not it thinks they are spammers. If it thinks they are, it automatically notifies the spam ninjas who then decide whether or not SpamBrainz was correct.

What’s great about this system is that a human is guaranteed to look at any report and at no point does a computer decide that you’re a spammer and should be banned, because no one wants machines to run the world, right?

Building SpamBrainz

While most GSoC projects involve adding features to existing systems, SpamBrainz is something entirely new and I had not built anything on this scale before so I started out by doing tons of research.

As I was not working for MetaBrainz at the time and had to respect our privacy policy, I wrote a script to collect the most common values of a couple different editors, anonymize them and save them to a report. Using that data I could compare all spam and non-spam editors and decide upon a set of datapoints that would be useful for my machine learning model. Yvanzo then ran these on the live database and I could happily do my data analysis without compromising user privacy.

Next I built a pretty boring Flask-based API that would allow MusicBrainz to queue up editor analysis and training. Quite a few different MetaBrainz projects use Python and need to access the MusicBrainz database so a long time ago someone wise decided to move commonly used code into a repository called brainzutils-python. All I had to do was to add some code for accessing editor data through it.

In a surprise move by ruaok I was then hired by MetaBrainz as a contractor with a yearly salary of 100g of chocolate. I probably should have negotiated what kind of chocolate but what mattered most was that I could now work with user data without breaching our privacy policy.

But before I could build my Keras model I had to decide on a final set of input features and do write code for preprocessing the data. Only then could I finally get started building and testing models.

The current SpamBrainz state of the art model is Lodbrok which actually turned out to work really well, reaching a 99% accuracy in detecting spam while only mis‐classifying 0.2% of real users as spammers. Obviously the latter won’t be a problem because after all a Spam Ninja will still check these reports.

Future outlook

Now that GSoC is over I could just disappear with all the money and leave SpamBrainz in its current state but obviously that’s not what I am planning to do.

I would like to work with zas on getting it deployed along with the Spam Ninja system, improve the code documentation and try to tackle the remaining problem that is online learning (which as it turns out, isn’t as easy as I had thought).

With spam always evolving and spammers already moving to more sophisticated methods than just using editor biographies, I’d also look into building separate models for other entities.

After all SpamBrainz is just getting started and I’m very much looking forward to continuing our journey towards reducing the spam we all have to endure on MusicBrainz and other MetaBrainz projects.

Here comes an end to a fantastic summer for this year and time to wrap up my GSoC project which I have been working in for the last 3 months (the official GSoC coding period).

Hello people!!

I am Rashi Sah, an undergraduate student at the National Institute of Technology, Hamirpur, India. I have been working on a really cool AcousticBrainz project for MetaBrainz Foundation Inc. as a participant in Google Summer of Code ‘18. It has been an amazing experience and I’ve learned a lot over the summer, spending countless days and nights to successfully take the project to the stage of completion. I decided to contribute to MetaBrainz in late December, then spent some time understanding the codebase of the project and then began creating pull requests and pushing commits for many features, tasks and fixing bugs since January 2018. This blog post consists of my GSoC experience as a student and the work I’ve done for the program so far.

Before starting the GSoC program, I started looking for some good-first-bugs initially and found some tickets to work on. Then I talked to the AcousticBrainz community members and started contributing. I created some big PRs mostly for adding new features to AcousticBrainz. I also worked on many bug fixes which are already merged into the AcousticBrainz codebase. New feature additions PRs include AB-21, AB-98 and AB-298. In mid‐February, I started looking for a suitable idea to work on for GSoC program and to create a proposal for the same. As the month of March was approaching, I did a lot of proposal discussion with MetaBrainz community members especially with Alastair, AcousticBrainz project lead who has helped me a lot in reviewing and guiding me to improve my proposal to a better extent. Later April, my proposal for a more detailed integration of AcousticBrainz with MusicBrainz got accepted. In the community bonding period, I mostly tried to continue my work which I was already doing for the past 3–4 months.

Getting entity information from the MusicBrainz database

The first thing I worked on when the official GSoC coding period began was adding a way to directly access MusicBrainz database for different entities to the MusicBrainz database module in BrainzUtils (a Python utility for all of our MetaBrainz projects). I worked on getting artist and release entity information from the MusicBrainz database via a direct connection. (See PRs BU-13 and BU-14.) Later, I worked on setting up the MusicBrainz server by adding a service in AcousticBrainz’s docker-compose files allowing us to easily read data directly from the MusicBrainz database in AcousticBrainz (PR AB-334). Our major aim of the project was to implement both the methods of MusicBrainz database access in AcousticBrainz especially importing the MusicBrainz database in AcousticBrainz from scratch and then to decide which methods works better while implementing a particular functionality in AcousticBrainz using MusicBrainz data.

Import the MusicBrainz data in AcousticBrainz database

MusicBrainz’s database contains a huge number of tables, but I analysed the use case of MB data in AB and made a list of those tables that we would actually require in our AcousticBrainz integrations. Then I made a PR (AB-338) for creating new tables in the AB database under the MusicBrainz schema. Later, I worked on a big PR (AB-340) which imports MB data corresponding to each and every recording present in AcousticBrainz’s database and writes the data into the tables of the MusicBrainz schema in AB. This PR was really huge and I had to take care of a lot of integrity constraints and foreign key dependencies.

Update MB data in AB for every new recording added to AB

Another feature I worked on after importing the MB data was updating the MB data present in AB whenever any new recording is added to the AcousticBrainz database (see PR AB-346) by importing the data from MB’s database via the direct connection. While working on a few bug fixes, I and my mentor, Param realized that the MB data import is taking a lot more time than expected when I applied the MusicBrainz importer script for full MB data dumps (of around 2.8 GB). So, I then worked on making the MusicBrainz importer more efficient and was able to import the data for few recordings within seconds (see PR AB-348). I had to figure out a lot for each table import and to detect the parts of the code which were making things slower.

To reduce the load on the processor, I included a sleep schedule of 5 seconds in the MusicBrainz importer module to wait before importing data for any new recording (see PR AB-354). During my GSoC period, I learned how important it is to write tests and make them run fast. I wrote tests for almost every script inside the db module. Later, I worked on writing tests for the MusicBrainz importer script (AB-352).

Then came another tricky part of this project which was to update the MusicBrainz schema data in AB whenever there is any change in the actual MusicBrainz database whether it is an update or a deletion taking place. MusicBrainz provides hourly replication packets which describe the changes to the database in a specific period. Replication packets are .tar.bz2 archives with a collection of files in them which can be downloaded via the MetaBrainz API. Lukas Lalinsky, a long-time contributor to MetaBrainz projects, the founder of AcoustID and maintainer of the mbdata Python module, had worked on implementing replication packets on MB data. I did a lot of modifications in his script to apply replication packets to the MusicBrainz schema data till it’s recent update for the recordings data present in AcousticBrainz (see AB-350).

Integration with MB database: Use MBID redirect information to get original entity

After working on the direct connection and importing the MusicBrainz data, keeping it updated by all means, it was time to start working on writing evaluation scripts to decide the better method for any integration we apply in AcousticBrainz. I wrote a script to implement an integration in AB with MB database to use the redirect information of an entity and then returns the original entity corresponding to the MBID provided (see PR AB-356).

Evaluate both methods of MusicBrainz database access in AcousticBrainz

Now moving towards the last work of my GSoC period and the most important as well. After working on both the methods, we really needed to evaluate both in order to test which one is more efficient for any specific integration with the MB database. I first wrote an evaluation script which fetches the data from the recording and low-level tables. For this case, the difference between the time taken by both methods comes out to be really large (approx. 70 seconds for around 250+ recordings). So whenever we would have to get the data from local AB tables and MB tables as well, we would go for the import database method as this method turns out to be faster than the other one. Next I tested with the MBID redirect integration part in which I didn’t find much difference between both the methods (PR AB-357). But I ran these tests locally, the tests in production may yield different results.

All in all, it has been an exciting summer. By this time I am familiar with a very good part of the AcousticBrainz codebase. I really look forward to work on adding a lot more integrations with MB data in AcousticBrainz and plan to completely remove AB’s dependency over the web service to use the MusicBrainz database which would be very useful for the users.

These last three months were full of thrill, excitement and much frustration as well. And this doesn’t end here, I’d love to contribute in the future and act as a maintainer for the AcousticBrainz project. I believe people must try to contribute to open source organizations as it helps you learn and gain much experience in a short period of time especially when working for a great platform like Google Summer of Code.

I am really happy working with the awesome MetaBrainz community and the people here are fantastic. I’d love to stay being a part of MetaBrainz in future as well. So in the end a big thanks to my mentor Param Singh, without his help & support throughout the program, wouldn’t have been possible for me to reach the end phase of GSoC, and my organization admin Robert Kaye, AcousticBrainz project lead Alastair Porter and all of the MetaBrainz Foundation community members for choosing me as a GSoC student and thus providing me such a great opportunity and also for being very kind and helpful throughout the program. And I want to thank Google for making this all possible. Hope I get a chance to work with you all again!!

Hi, I’m Kartikeya Sharma, a postgrad student at National Institute of Technology, Hamirpur. I’ve worked on the project MessyBrainz as a student developer for GSoC 2018. Robert Kaye mentored me during this GSoC programme. The goal of my project is associating MBIDs to MSIDs and clustering together the MSIDs which represents the same MBID. The MBIDs represent MusicBrainz Identifier. It is an Universally Unique Identifier that is permanently assigned to each entity in the MusicBrainz database, MSID represents MessyBrainz Identifier which is associated with each unique recording, artist_credit and release in MessyBrainz database. In simple words MSIDs represents unclean metadata whereas MBIDs represent clean metadata.

This blog post summarizes the work that I did in my project, which was divided into three parts.

Processing the data already in MessyBrainz database

The first part involves creating clusters using the MBIDs already present in the MessyBrainz database. This involves creating clusters for recordings, artists, and releases. To implement this part I created the following three PRs #37, #41, and #44.

After that, I began to work on the second part of this which involves creating clusters using the artist MBIDs and release MBIDs and names fetched from MusicBrainz database. I needed to access MusicBrainz database, for that, I first had to work on BrainzUtils to have methods to access MusicBrainz database to fetch artist MBIDs using recording MBIDs and release name and release MBIDs using recording MBIDs. The part to fetch artist MBIDs was done during the community bonding period in PR #14 at BrainzUtils and to fetch releases I created PR #18 at BrainzUtils during GSoC coding period. After that, I created a PR to create clusters using the fetched artist MBIDs #47 and another one to create clusters using releases fetched #49.

I did write around 60 tests which proved to be vital in making sure that the code does what it’s supposed to do.

Processing the data as it is inserted into the MessyBrainz database

Creating clusters for the data inside the database requires a lot of resources. So, it was better to create clusters as recordings are inserted into the database but, even this type of clustering is not efficient. So, to cluster these recordings first these recordings are sent to rabbitMQ server and from that, these are sent to a clustering script which runs in a different container and runs continuously and clusters the incoming recordings. That way it does not slow down the process of submitting recordings to the database. For this I created PR #50.

Create endpoints to access MSIDs and MBIDs

I created two API endpoints in PR #51.One endpoint is to fetch MBIDs and MSID using an MSID. Another endpoint is to fetch MSIDs using an MBID. This way end users can access MBIDs and MSIDs which may be used for calculating different stats.

Apart from that with the help of my mentor, I did setup a VM to test the above code on the MsB datadump. This task had some challenges: first I had to create indexes for various fields to speed up the process of clustering. Without indexing, it would have taken approximately 37 days but after creating indexes on various fields It just took 3 hours. I found out that PostgreSQL does allow to create indexes on functions too which came into use while creating artist_credit clusters for which I created a custom function. Indexes were created in PR #53. When I ran the clustering code on a VM on which the whole MessyBrainz datadump was present I found out that we have fields in recording_json table which are supposed to store MBIDs but were pointing to empty strings. This was not supposed to happen initially as ListenBrainz is the only source of data for MessyBrainz currently. Submissions to MessyBrainz are restricted from users directly and ListenBrainz does validate listens for that. So, those recordings must have been inserted before that validation was present. To solve the problem I created PR #52.

The summer was a great learning experience for me. I started slowly as things were messy at the start. As at the start everything wasn’t crystal clear to me, I wasn’t sure on how exactly to write scripts that manipulate database and did write the scripts in the most trivial way possible. Here I was doing a query for every single MBID to first check if it’s present in the recording_cluster tables and if not then cluster the recording. Which is conceptually correct but not efficient by any means. And this could be done by executing a single query on the recording_json table to fetch only those recording MBIDs that are not present in recording_redirect table as those are unclustered. That way we don’t have to process the recording MBIDs that have been already processed making the process of clustering efficient.

With time I got an understanding of how clusters are created and how to handle anomalies. Such as James Morrison. In the end, the definition of anomaly can be put as an MSID represents an anomaly if it points to different MBIDs in entity_redirect table (entity can be artist_credit, recording, and release).

Work to be done ahead

The project is still in its initial stages and requires a lot of work to be done before moving it into production. We still need to write integration tests for ClusterWriter and API endpoints. After that, we can work on the Additional Ideas that I proposed in my proposal. We need to figure out some way to associate MBIDs to MSIDs for the artists, recordings, and releases where no MBIDs are present. This does not seem like a trivial task with so many anomalies to take care of.

Last three months have been a great experience for me. I would like to thank Robert Kaye, Param Singh, and Alastair Porter who helped me to solve a lot of problems that I encountered during the entire period. Working on their suggestions and reviews I was able to write good quality code which was efficient as well. The work culture at MetaBrainz inspired me a lot. At MetaBrainz we have weekly IRC meetings where we get to know what others are doing at the organization and also get a place to tell what we did in our past week. I would like to thank MetaBrainz and Google for giving me this chance to get involved in open source on such a cool project. The association of MSIDs to MBIDs can be used by ListenBrainz as stats are calculated on MSIDs which can then be mapped onto MBIDs which represents clean metadata. I would like to work on the project further because of the learning opportunities that are present in the project.

Back in 1998 when I started playing with Perl and wrote the CD Index (the pre-cursor to MusicBrainz). I was learning web development and had little understanding of web design. The tools I was using were primitive at the time and the results were cringeworthy and have not withstood the test of time.

Fast forward some 18 years and we’ve arrived at the current MusicBrainz site design — there have been minor facelifts over time and a bigger one once we released NGS back in 2011. But really, the site design hasn’t changed much and we’ve kept gluing features and new bits of data onto the crappy design, leaving us with the current mess of a UX experience we know as the modern MusicBrainz.

Our community has been asking us to improve UX for a long time — we need to:
• Empower our community with better tools for developing, editing, viewing the magnificent data that we have.
• Build a stronger foundation for further development, interaction, and extension of our projects in future
• Make our projects more welcoming to newcomers, by lowering the learning curve as well as keeps the workflow of an advanced editor intact.

Fortunately for us, Chhavi [a design student from IIT, India] has become an active contributor to the MetaBrainz projects. She has been studying our sites and how we work as a team and has volunteered to drive the process to fix the UI and the user experience issues on the MusicBrainz site. She has proposed a part of this work as her Google Summer of Code project.

Our overall goal as a team is to create a design system which will help the designers and developers stay in sync, give a more unified theme to our projects, and make it easier for new contributors to join our projects. This will also make it much easier for our developers to address your requests for features/bug fixes faster in the future.

We are not barging into your online lives and trying to make our sites pretty — instead, we are focusing on the real experiences you have with them. We held long detailed conversations during our last summit in Barcelona, where Chhavi was also present and discussed a lot of concerns that might be running in your head while you read this. As part of this initiative, we have been interviewing a number of key members of our project to understand what we and our users really need from this revamp. We have also kept track of community discussions around this topic. From this we decided that our users fall into three broad categories:

When you look at these pages, please keep in mind that we’re trying to clean up the clutter and to make things simple and clean. Easier to understand for an experienced editor or a new one. The data that we have should be presented in a way that makes sense. The data should present the gaps and holes that it presently has, for people to be able to improve the data gaps. Data should also be our binding link to exploit the full potential of the projects that we have, such as ListenBrainz or CritiqueBrainz.

We are not trying to fluff things up and make them look pretty. Prettiness might come with the simplicity that we are chasing. Having user flows that do not hamper the speed and makes our life easier, is our utmost goal.

That said, we are happy to receive feedback on the upcoming designs as well as the process– if you have any, please post your comments to the appropriate tickets in Jira that we linked above. We’re currently getting some pressing dev tasks out of the way before we start the actual implementation of the redesigned project. Once our team is ready to work on this, we will public more blog posts about how this project will unfold and how it will impact our users.