tag:blogger.com,1999:blog-9628557Tue, 18 Mar 2014 16:25:26 +0000musicmp3oggpythonlast.fm2007artistsautoqueuebandsblogscalendarseventsfestivalsgoogle calendarpluginsquod libetsoftwaretoolsyahoo pipesarcaudioautomationbarbipesdownloadsemailfreefungainideasjavascriptlegallightninglinuxlisplowlandsmarkovmp3blogsplayersprogrammingqueueingradioremindersrhythmboxrippingrssrsyncscriptingsimilarityspamspideringspiderssynchronisationtaggingthe netherlandsthunderbirdubiquitywgetthisblog"Look out honey, 'cause I'm using technology..."http://thisfred.blogspot.com/search/label/oggnoreply@blogger.com (Eric Casteleijn)Blogger4125tag:blogger.com,1999:blog-9628557.post-155000466927927283Sat, 23 Feb 2008 11:44:00 +00002008-04-18T08:51:10.454-07:00blogsmp3musicoggpythonThe Musical Gardener's Tools #5: Yet Another Way to Harvest mp3blogs<p>Update 2008-03-11: There were a number of things wrong with this script making the spidering *waaaay* slower than it needs to be. Fixed that below, and added threading for both the spidering and downloading, thanks to <a href="http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/496799">this cool recipe by Wim Schut</a> which lets me run all the sqlite code in a separate thread. (Important because you can only use sqlite connections in the thread in which they were created.) All of this results in a nice speed-up.</p>
<p>Ok I said I wasn't going to, but I did end up writing a bit of code, although it didn't get too far out of hand. Yet :). It solves *all* of my problems: it does not download files over 30MB in size, and it never downloads the same link twice.</p>
<p>I found <a href="http://mail.python.org/pipermail/python-list/2007-August/454005.html">this message</a> on the python mailing list, which seemed like a very good start. It almost did what I needed, but not quite, and also the parsing was overcomplicated and didn't catch all links, so I replaced that with a simple regular expression.</p>
<p>I ended up changing most of the code and functionality, (for instance it now stores links in a database.) There's a lot of hard coding in there, which I could factor out if people want to use it, but for now it solves my problems beautifully ;).</p>
<p>It's used with the following syntax:</p>
<div class="code">
<code><pre>
# initial set up
python spider.py createdb
# add a new blog to be harvested
python spider.py add http://url.of.blog/
# (shallowly) spider all blogs for new links to files
python spider.py
# spider a url to a specific depth (5 for example should get
# most everything, but will take a while)
python spider.py deepspider 5
# download all files
python spider.py download
</pre></code></div>
<p>A minor problem is that curl doesn't do *minimum* file sizes, and with a lot of broken links it does download something small that isn't really an ogg or mp3 file, but a http response. I can probably solve this better, but for now I call the download from an update script as follows:
<div class="code">
<code><pre>
python spider.py download
find . -iname "*.mp3" -size "-100k" -print0 | xargs -0 rm
find . -iname "*.ogg" -size "-100k" -print0 | xargs -0 rm
find . -iname "*.mp3" -print0 | xargs -0 mp3gain -k -r -f
find . -iname "*.ogg" -print0 | xargs -0 vorbisgain -fr
</pre></code></div>
<p>Translation: download files, throw away suspiciously small ones, mp3/vorbisgain what's left.</p>
<p>Here's the code:</p>
<p>Edit 2008-04-18: Moved the code to google code, so I don't have to update it here. Find the latest version here: <a href="http://code.google.com/p/thisfred-python-stuff/source/browse/trunk/mp3spider/spider.py">spider.py</a>http://thisfred.blogspot.com/2008/02/musical-gardeners-tools-5-yet-another.htmlnoreply@blogger.com (Eric Casteleijn)0tag:blogger.com,1999:blog-9628557.post-6561281082642119268Thu, 24 Jan 2008 17:07:00 +00002008-02-23T03:58:40.600-08:00automationblogsmp3musicoggscriptingwgetThe Musical Gardener's Tools #4: Lazyweb, lazyweb on the wall...<p>..who is the smartestest wgetter of them all?</p>
<p>I need a little help here. As I've described as part of <a href="http://thisfred.blogspot.com/2007/03/musical-gardeners-tools-2-more-sources.html">an earlier post</a>, one of my sources for new music is wget, in combination with an ever growing list of mp3 blog urls. The ever growing part is now slowly starting to become a problem. I ran my update script yesterday evening and it took well over 12 hours to complete. (Mind you, I have fiberoptics to the door, speed is not an issue, at least not at my end.) That is unacceptable, in terms of energy wasted. Also the way it works potentially wastes a lot of bandwidth for the poor blog owners, mostly because files I have deleted are downloaded again, unless they were removed from the blog in the meantime. Note that this hits sites heavier that put up music I don't like or already have, but that should hardly be the measure of all things. Maybe. ;)</p>
<p>I see two ways to solve this:</p>
<ol>
<li>drastically clean up the list of urls that I harvest from.<br /><br />
This is possible, I do it semi-regularly, but new and interesting mp3 blogs keep popping up, so this is only a short term solution.<br/><br/>
</li>
<li>filter out the stuff I know I don't want<br/><br/>
To some extent, I know what I don't want to download. First of all, long podcasts and extended mixes (let's arbitrarily say, anything over 20MB,) since the way I like to listen to music is at the individual track level, otherwise all my tagging tools and last.fm don't work. Anyway we're getting past the whole idea that (web) music radio is consumed in an order predefined by someone else. More suggestion, less force feeding, kthxbye. (On a tangent: can we get this for news radio: just the news items, not a whole, usually extremely repetitive, bulletin as atomic? True podcasting should let me skip items I'm not interested in/have already heard.) Second of all, for obvious reasons, all the files I've already downloaded but deleted.
</li>
</ol>
<p>Since I am far from a linux command line deity, I thought I would ask here, does anyone have any suggestions on how to start on tackling these two problems, given the script:</p>
<div class="code">
<code><pre>
wget --timeout=5 -U"Mozilla/5.0" -r -l1 -H -t1 -x -nc -np -P ~/mp3blogs/ -A.mp3,.ogg -erobots=off -i ~/mp3blogs/urls.txt
</pre></code>
</div>
<p>A: How can I limit the length of mp3s and oggs downloaded in this way to for instance 20MB per file? Keep in mind, throwing them away after downloading is not an option, since I want to prevent the download from happening at all. I don't think wget has a switch for this, so it will probably not be possible in a one liner.</p>
<p>B: I would like to store all of the urls of the files I do download (probably just in a flat text file for now) and then have my script skip them when downloading. Again, I don't think a one liner is possible.</p>
<p>Solutions to either problem are worth a 20$ amazon voucher from me (or somewhere else, I don't really care, as long as I'm out only 40$ total and it's not too much hassle to get it to you.)</p>
<p>I am, of course, the sole judge of this contest, but I will try to be fair. You don't have to give me a whole script, I'm a fairly competent programmer, just not too deep into bash, but if you'll point me at where to start, and I get it to work, that counts as a solution. Although as I've said, it's going to grow beyond a one liner, I would like to keep it a simple script, and I'm not looking for an application. I could build one in Python myself, but I want to keep it zero maintenance, basically too simple to even put the code into subversion.</p>
<p>UPDATE 2008-01-28: I'm now looking into pavuk, which may or may not have all the features I need. If this works, I just earned myself 40$ :)</p>
<p>UPDATE 2008-01-28.1: pavuk, although having rather exotic naming of options and switches, seems to solve A quite nicely, which is a bandwidth (and time, and thus energy) saver. Finding all the right options was made much easier by <a href="http://pavuk.cvs.sourceforge.net/*checkout*/pavuk/pavuk/wget-pavuk.HOWTO?revision=1.2000">this guide</a>. I'm still thinking about solving B, there may be options in pavuk to help me with that too.</p>
<p>For completeness' sake, the updated script looks like this (except it should all be one line...):</p>
<div class="code">
<code><pre>
pavuk -timeout 5000 -identity "Mozilla/5.0" -lmax 1 -retry 1 -dont_leave_dir -cdir ~/mp3blogs/ -asfx .mp3,.ogg -noRobots
-urls_file ~/mp3blogs/urls.txt -maxsize 30000000 -fnrules F '*' '%h/%d/%n'
</pre></code>
</div>http://thisfred.blogspot.com/2008/01/lazyweb-lazyweb-on-wall.htmlnoreply@blogger.com (Eric Casteleijn)0tag:blogger.com,1999:blog-9628557.post-7329691754046044407Wed, 20 Jun 2007 07:53:00 +00002008-01-17T04:33:07.786-08:00audiolinuxmp3musicoggrsyncsynchronisationThe Musical Gardener's Tools #3: The Kitchen SyncOne of the potential downsides of obtaining your music from a large number of mixed quality sources is that your collection will be overrun by crap if you don't aggressively cull the crap. Since I listen to music on at least 4 machines (my laptop, my work desktop, my home desktop and my iAudio M5 hard drive player) synchronisation could become nightmarish: If I delete something from my laptop and I sync with any of the other machines, I don't want the deleted crap to reappear, but I do want new stuff I downloaded to get transferred.
The way I solved this is with a few scripts using the wonderful <a href="http://en.wikipedia.org/wiki/Rsync">rsync</a> and a bit of self-discipline:
<h4>syncing between computers</h4>
I have two scripts on my work desktop called hello.sh and goodbye.sh. The former I run every day when I come into the office in the morning and this synchronizes all music from my laptop onto my desktop, including new, changed or deleted files:
<div class="code">
<code><pre>
#! /bin/sh
rsync -avz --delete laptop:~/ogg/ ~/ogg
~/ogg/mp3blogs/update
./rm_empty
rsync -avz --delete ~/ogg/ laptop:~/ogg
</pre>
</code>
</div>
where 'laptop' is the hostname of the laptop, and 'update' and 'rm_empty' are the names of the scripts mentioned in <a href="http://thisfred.blogspot.com/2007/05/new-lastfm-features-rock.html">a previous post</a>. So, the script does the following, in order:
<ol>
<li>synchronize files from laptop to desktop</li>
<li>download new files from selected mp3blogs to the desktop</li>
<li>remove any empty directories under the ogg directory on the desktop</li>
<li>synchronize files from desktop to laptop</li>
</ol>
That last step is actually redundant when I don't forget to use the accompanying 'goodbye.sh' script when I leave at night, but sometimes I do, when I have to run for a train.
The 'goodbye.sh' script is even simpler:
<div class="code">
<code>
<pre>
#! /bin/sh
./rm_empty
rsync -avz --delete ~/ogg/ laptop:~/ogg
</pre>
</code>
</div>
and does the following:
<ol>
<li>remove any empty directories under the ogg directory on the desktop</li>
<li>synchronize files from desktop to laptop</li>
</ol>
<h4>syncing between a computer and a music player</h4>
For this I wrote a little Python script, mostly because I like Python syntax much better than whatever shell script syntax (yeah, I'm new school), but it could be easily solved differently. The use case here is: all the music on any one of my computers will never fit on the puny 20GB my music player sports. That's ok, because this is only meant to hold the music I *know* I like, and to which I like to relax on the train to and from home. So the problem is we want to synchronize a subset of the music on (for instance) my desktop. I made a script that does this:
<div class="code">
<code>
<pre>
#!/usr/bin/env python
from os.path import isdir
from os import listdir, system
local = '/home/eric/ogg'
iaudio = '/media/IAUDIO/MUSIC'
localdirs = listdir(local)
iaudiodirs = listdir(iaudio)
for entry in iaudiodirs:
iaudio_path = iaudio + '/' + entry
local_path = local + '/' + entry
if isdir(iaudio_path):
if entry not in localdirs:
print "synching %s from iaudo to local" % iaudio_path
system('rsync --size-only --delete --delete-excluded \
--exclude-from= /home/eric/.rsync/exclude -avz \
--no-group %s/ %s' % (iaudio_path, local_path))
else:
print "synching %s from local to iaudo" % entry
system('rsync --size-only --delete --delete-excluded \
--exclude-from= /home/eric/.rsync/exclude -avz \
--no-group %s/ %s' % (local_path, iaudio_path))
</pre>
</code>
</div>
With small modifications, this can be made to work with any music player that behaves like an external HD under Linux (obviously paths and directory names need to be changed, I did not try to make this script generic).
What it does is run through all the artist directories on my player.
If an artist directory exists there that is not on my desktop, it copies it, under the assumption that it is a new artist that I like and picked up somewhere or other.
If the artist directory *does* exist, it does the exact reverse: it syncs from the computer *to* the player, under the assumption that I only delete or add single files on the desktop, since it's too much of a hassle to do it on the music player directly.
If either of these assumptions are not valid in your case, obviously the script wouldn't work for you without some serious modification.http://thisfred.blogspot.com/2007/05/musical-gardeners-tools-3-kitchen-sync.htmlnoreply@blogger.com (Eric Casteleijn)0tag:blogger.com,1999:blog-9628557.post-7208108743915430345Wed, 07 Mar 2007 17:01:00 +00002007-08-23T02:43:55.334-07:00downloadsfreelegalmp3musicoggThe Musical Gardener's Tools #2: More Sources<p>My second biggest source for new music is the web, where, with a little work, a lot of high quality free and legal stuff is to be had. Here are some of my tips:</p>
<h4><a href="http://www.last.fm">www.last.fm</a></h4>
<p>Easily my favorite website/service of the last years. For anyone still unfamiliar with it, what it does, in a nutshell, is keep track of all music you listen to on your computer (or even on your portable music player,) and generate <a href="http://www.last.fm/user/thisfred">weekly and lifelong personal and global charts</a> from that.</p>
<p>While people with charts fetishes may feel that's quite exciting already, where last.fm positively shines is what it does with those charts; After a few hundred songs, it starts to compute your musical neighbours, and recommended artists you may or may not have heard of. It lets you listen to a personal 'recommended radio' station, which is in my opinion last.fm's greatest feature. It will play the artists last.fm thinks you might like based on your neighbours, in addition to personal recommendations from other users, and recommendations sent to groups you belong to.</p>
<p>What I usually do is have 'recommendation fridays' where instead of starting my regular music player, I listen to recommendation radio all day. If stuff comes by that I really like, I check whether it's available as a download on last.fm, (there are <a href="http://www.last.fm/charts/free/">loads of free downloads</a>,) or see if it's available elsewhere.</p>
<p>See the sidebar on the right for my weekly artist chart, and a link to 'thisfred radio' which you can listen to from any flash enabled browser, or from last.fm's own standalone music player.</p>
<h4><a href="http://www.daytrotter.com">www.daytrotter.com</a></h4>
<p>The Daytrotter Sessions are a great and consistently high quality source of unique mp3s. The idea is that bands touring the area stop by at daytrotter, exclusively record three or four songs, which are then put up as free mp3s on the site. The bands are usually on the indie side of the fence, and on the verge of breaking through, although there are some bigger names in the list.</p>
<p>To consistently make available a new interesting session at least every week for a good while now, is a pretty amazing achievement. The new edition is a welcome surprise in my bag o' RSS each week.</p>
<p>A few of my personal favorites:</p>
<ul>
<li><a href="http://www.daytrotter.com/daytrotterSessions/68/free-songs-casiotone">Casiotone for the Painfully Alone</a></li>
<li><a href="http://www.daytrotter.com/daytrotterSessions/577/free-songs-about">About</a>. This one just in, and maybe a bit chauvinistic, since they're from the Netherlands. I gather they'll be playing <a href="http://2007.sxsw.com/">South by Southwest</a> (see below) in Austin this month, so do check them out if you're there (if you are: I'm green with envy,) and in the mood for some high energy melodic bleepcore laptop pop.</li>
</ul>
<h4><a href="http://player.sxsw.com/torrents/">South By Southwest Showcase Torrents</a></h4>
<p>I've never been to SxSW, but every year it looks like I'm missing a lot, and I definitely plan to save up and go there one year. That year won't be 2007 unfortunately. *Fortunately*, for us Atlantically challenged Erpians, SxSW makes available a torrent of mp3s from artists that will be playing the festival each year. Apparently, not all of the music industry is clueless. The torrents go back to 2005, and are pretty large. it's some 8GB of music, a *lot* of it very good.</p>
<h4><a href="http://amiestreet.com/">amiestreet.com</a></h4>
<p>Just discovered this today: Amie Street is an mp3 web store with several twists: First of all: DRM-free, which is a sine qua non for me, but not terribly earth shattering. What is interesting is their business model: All mp3s start out as free, as in beer, downloads, but rise in price as they get more recommendations. People recommending the mp3s that get popular get a little kickback, if I understand correctly, which they can use to buy other mp3s. So it literally pays to check out new and unknown stuff, and the less adventurous/miserly users have a pretty good indication of popularity in the price of individual mp3s.</p>
<p>After sifting through some of the free mp3s, I must say the quality is varied to say the least, but that's to be expected. What I think I'll do is shell out some money, and jump in after the first round of sifting through is done, and look for the gems in the 1-10¢ price range. Watch <a href="http://amiestreet.com/users/thisfred">this space</a> for my recommendations.</p>
<p>I do think this might work as a business model, where you let users with little money pay with their time, and vice versa. It does feel right. And they don't just have completely unknown bands on there either. I already saw Barenaked Ladies and Au Revoir Simone advertised.</p>
<h4>Your favorite mp3blogs and <a href="http://www.gnu.org/software/wget/">wget</a></h4>
<p>A slightly more geeky way to get your mp3s, which I originally found <a href="http://www.veen.com/jeff/archives/000573.html">here</a> and then slightly adapted to suit my particular needs better.</p>
<p>As noted by Jeffrey in his post, using wget for this in the wrong way can cause bandwidth problems for the sites you are hitting, so use caution: presumably you are targeting those sites because you like the music they make available, causing them problems is probably not the best way to ensure they continue to do so.</p>
<p>The way I call wget is:</p>
<div class="code">
<code><pre>
wget -U"Mozilla/5.0" -r -l1 -H -t1 -x -nc -np -P ~/mp3blogs/ -A.mp3,.ogg -erobots=off -i ~/mp3blogs/urls.txt
</pre></code>
</div>
<p>(That should be all on one line.)</p>
<p>My wget call differs from Jeffrey's in the following ways:</p>
<ul>
<li>I added -nc which stands for 'no clobber', it means it won't re-download files that are already there, which I'm sure makes the site owners happier. I think the original does a checksum check on the files, so it won't reload them, *unless* they have changed. Since I use mp3gain on the files, and almost always correct some tags, that means they would always be downloaded again in my case, losing the changes I made...</li>
<li>I removed -nd and added -x, which forces directories for the entire url path, because I like having the directories over a single directory with all the files: It shows me where the files came from, so I can give kudos for those I like, and if I end up getting a lot of crappy ones from a particular site, I can remove its url from urls.txt. This can mean a lot of empty directories after a while, but I have a script for that too, see below.</li>
<li>I added .ogg to the file mask, just on the off chance that someone out there is providing oggs rather than mp3s.</li>
</ul>
<p>Some more nice wget tips can be found <a href="http://applications.linux.com/article.pl?sid=07/01/08/2219231">here on linux.com</a></p>
<p>After the update, I run the following bash script to remove any empty directory trees that are created by using wget in this way:</p>
<div class="code">
<code><pre>
#!/bin/bash
LS="$(find ~/mp3blogs -type d -empty)"
echo $LS
while [ -n "$LS" ]; do
find ~/mp3blogs -type d -empty -print0 | xargs -0 rm -rf
LS="$(find ~/mp3blogs -type d -empty)"
done
</pre>
</code>
</div>
<p><strong>[Edit 2007-08-23:]</strong> One thing that script doesn't take into account is album covers: my <a href="http://www.sacredchao.net/quodlibet">excellent music player</a> lets me directly delete songs from the hard drive if I decide I don't like them, but when jpegs or playlist files remain in a directory when all the songs have gone, it won't ever get cleaned up. So I wrote a new version, that also takes an argument for the path:</p>
<div class="code">
<code><pre>
set -u
find $1 -depth -type d | while read dir
do
songList=`find "$dir" \( -iname '*.ogg' -o -iname '*.mp3' \)`
if [[ -z "$songList" ]]
then
rm -rf "$dir"
fi
done
</pre>
</code>
</div>
<p>Then all that remains is to run the recursive mp3gain and vorbisgain commands I described in my previous post.</p>
<p>Of course I call these 3 commands (and then some I will talk about in an upcoming post) from a single master script, called 'hello', which I run about once a day while I get morning coffee.</p>http://thisfred.blogspot.com/2007/03/musical-gardeners-tools-2-more-sources.htmlnoreply@blogger.com (Eric Casteleijn)0