Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

I am currently designing a Data Mining Project where I am going to harvest rather large volumes of Twitter data in order to analyse locational data (geocoded tweets) and do some machine learning with it.

What I want to do: I'll have some scripts that run 24/7 on a small Samsung Netbook (<2GHz,1GB RAM, 200GB Disk), limited in frequency only by the query limit of the Twitter API. These scripts will save various sorts of data in a database, which in turn will later be used as a base for analysing data.

I am quite experienced in RDBMS, thus I also know their limitations. I just read about CouchDB and its ability to store JSON in so called documents - this would come in handy because the responses from the Twitter API are in JSON, and some of those strings are quite nested and complex.

On the other hand, I don't really want to miss relational functionality, since I have for example a table user which saves general data about a Twitter account and a table geo which saves place-time-Tuples which always reference a particular user.

For analysis, the content of geo will be used in any possible way - I have not yet thought about geospatial analysis in depth, but there will aggregation, distance calculation, all that sort of stuff. This might be done with CouchDB's reduce-Functions in Javascript, I read? If I used a SQLite DB, I would just stick to Python and do everything there.

I don't really know what is possible in CouchDB since I am really new to that concept. I just saw that it is easy to store JSON and that the structure of the database is not fixed at all, so I could easily introduce new types of data or destroy old ones (DROP COLUMN is not possible in SQLite). Also, since I know Javascript pretty well (actually better than Python), it might be easier for me to do analysis on the data.

What do you think? Is there a striking advantage in using NoSQL for doing that sort of thing or should I stick to what I can do best?

1 Answer
1

I think the answer mainly depends on how much time do you want to spend learning new databases vs how much time to you want to spend learning machine learning.

For example PostgreSQL has lots of GIS stuff built-in which I assume could be very useful for your queries. CouchDB has many useful features with their map/reduce stuff but I find it a bit limited. If you think you're going to add many columns later on based on new algorithms you might want to look into the column-oriented databases like Apache Cassandra.

However, I would recommend splitting the problem into two parts:

Data gathering

Data analysis

The first part you seem to have a good plan for already. I would write a super simple app that would just suck in the data from twitter and put it in a table as a BLOB. Or just a file on disk.

When you have that you can go from the raw data and insert it into any kind of backend. Now you can choose backend based on the current problem (algorithm) you're working on, instead of having one solution to fit all your problems. The key point here is to extract exactly the data you need. That way you shouldn't have to consider the fact that the data is nested inside a document from Twitter as you only pick out the parts you need. If you go that way I think an RDBMS would perform very nice as they can run all kinds of queries.

I have thought about PostgreSQL with PostGIS as well, but wanted an easier solution to port to my Samsung Netbook (copying an sqlite.db file is so much easier than replicating a PG-Database). But what you're saying makes a lot of sense - I can collect the data now using something simple as sqlite or just using flat files, and then still import all the stuff into a particular backend such as PostgreSQL, in order to do fancy GIS analysis and so on! Thanks.
–
wnstnsmthDec 21 '12 at 9:00

pg_dump is fairly easy to use, but I'd stick to the tools I know if you want to actually work on/complete the mining parts.
–
trygvisDec 21 '12 at 9:50