On Tue, Apr 24, 2012 at 9:35 AM, Jared winick <[EMAIL PROTECTED]> wrote:> I gave an Introduction to Apache Accumulo presentation last month at> the Boulder/Denver Meetup where I demoed an application that used Accumulo> to provide real-time and historical access to words/phrases seen in Twitter> messages as well as daily trend analysis. I finally got the demo polished up> a bit and running on Amazon EC2 where it can be found> at http://trendulo.com.>> Trendulo is still pretty Alpha at this point so please feel free to add to> the existing documented issues at> https://github.com/jaredwinick/trendulo where you can also obviously find> the source.>> As an example, the following link will show the launch of Instagram's> Android client, followed by Facebook's purchase and then a small increase in> general "chatter" about the product http://goo.gl/XcCG8>> Let me know if anyone has any questions or comments. Feel free to tweet> @trendulo any interesting searches and I can retweet them out.>> Jared>>

Searching for the word school is neat, you can clearly see the weekends.

The domain name is cool.

Keith

On Tue, Apr 24, 2012 at 9:35 AM, Jared winick <[EMAIL PROTECTED]> wrote:> I gave an Introduction to Apache Accumulo presentation last month at> the Boulder/Denver Meetup where I demoed an application that used Accumulo> to provide real-time and historical access to words/phrases seen in Twitter> messages as well as daily trend analysis. I finally got the demo polished up> a bit and running on Amazon EC2 where it can be found> at http://trendulo.com.>> Trendulo is still pretty Alpha at this point so please feel free to add to> the existing documented issues at> https://github.com/jaredwinick/trendulo where you can also obviously find> the source.>> As an example, the following link will show the launch of Instagram's> Android client, followed by Facebook's purchase and then a small increase in> general "chatter" about the product http://goo.gl/XcCG8>> Let me know if anyone has any questions or comments. Feel free to tweet> @trendulo any interesting searches and I can retweet them out.>> Jared>>

> I gave an Introduction to Apache Accumulo<http://www.slideshare.net/jaredwinick/introduction-to-apache-accumulo> presentation> last month at the Boulder/Denver Meetup<http://www.meetup.com/Boulder-Denver-Big-Data/events/55277392/> where> I demoed an application that used Accumulo to provide real-time and> historical access to words/phrases seen in Twitter messages as well as> daily trend analysis. I finally got the demo polished up a bit and running> on Amazon EC2 where it can be found at http://trendulo.com.>> Trendulo is still pretty Alpha at this point so please feel free to add to> the existing documented issues at https://github.com/jaredwinick/trendulo where> you can also obviously find the source.>> As an example, the following link will show the launch of Instagram's> Android client, followed by Facebook's purchase and then a small increase> in general "chatter" about the product http://goo.gl/XcCG8>> Let me know if anyone has any questions or comments. Feel free to tweet> @trendulo any interesting searches and I can retweet them out.>> Jared>>>

On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <[EMAIL PROTECTED]> wrote:> I gave an Introduction to Apache Accumulo presentation last month at> the Boulder/Denver Meetup where I demoed an application that used> Accumulo to provide real-time and historical access to words/phrases> seen in Twitter messages as well as daily trend analysis. I finally> got the demo polished up a bit and running on Amazon EC2 where it can> be found at http://trendulo.com .> > Trendulo is still pretty Alpha at this point so please feel free to> add to the existing documented issues at> https://github.com/jaredwinick/trendulo where you can also obviously> find the source.> > > As an example, the following link will show the launch of Instagram's> Android client, followed by Facebook's purchase and then a small> increase in general "chatter" about the product http://goo.gl/XcCG8> > > Let me know if anyone has any questions or comments. Feel free to> tweet @trendulo any interesting searches and I can retweet them out.> > > Jared

Thanks for the kind words, I appreciate it. Keith, my ingest processwas down on Mar 19-20, so that is why I am missing data for thatperiod.

For those who are curious, I am receiving about 1.2 million tweets aday and have about 3 billion entries in my main table. I am actuallygetting by with everything running on an EC2 medium instance, which isobviously very far from ideal but I am trying to stay on a budget.

I hope to add new features as time allows, things like near real-timetrending and geospatial analytics. If anyone has any ideas forfeatures they think would be interesting, just let me know or add themas issues on the github page.

On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi<[EMAIL PROTECTED]> wrote:> That's so cool that I'm creating a new section for it on our page of links:> http://accumulo.apache.org/papers.html>> Billie>> On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <[EMAIL PROTECTED]> wrote:>> I gave an Introduction to Apache Accumulo presentation last month at>> the Boulder/Denver Meetup where I demoed an application that used>> Accumulo to provide real-time and historical access to words/phrases>> seen in Twitter messages as well as daily trend analysis. I finally>> got the demo polished up a bit and running on Amazon EC2 where it can>> be found at http://trendulo.com .>>>> Trendulo is still pretty Alpha at this point so please feel free to>> add to the existing documented issues at>> https://github.com/jaredwinick/trendulo where you can also obviously>> find the source.>>>>>> As an example, the following link will show the launch of Instagram's>> Android client, followed by Facebook's purchase and then a small>> increase in general "chatter" about the product http://goo.gl/XcCG8>>>>>> Let me know if anyone has any questions or comments. Feel free to>> tweet @trendulo any interesting searches and I can retweet them out.>>>>>> Jared

> Thanks for the kind words, I appreciate it. Keith, my ingest process> was down on Mar 19-20, so that is why I am missing data for that> period.>> For those who are curious, I am receiving about 1.2 million tweets a> day and have about 3 billion entries in my main table. I am actually> getting by with everything running on an EC2 medium instance, which is> obviously very far from ideal but I am trying to stay on a budget.>> I hope to add new features as time allows, things like near real-time> trending and geospatial analytics. If anyone has any ideas for> features they think would be interesting, just let me know or add them> as issues on the github page.>> On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi> <[EMAIL PROTECTED]> wrote:> > That's so cool that I'm creating a new section for it on our page of> links:> > http://accumulo.apache.org/papers.html> >> > Billie> >> > On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <> [EMAIL PROTECTED]> wrote:> >> I gave an Introduction to Apache Accumulo presentation last month at> >> the Boulder/Denver Meetup where I demoed an application that used> >> Accumulo to provide real-time and historical access to words/phrases> >> seen in Twitter messages as well as daily trend analysis. I finally> >> got the demo polished up a bit and running on Amazon EC2 where it can> >> be found at http://trendulo.com .> >>> >> Trendulo is still pretty Alpha at this point so please feel free to> >> add to the existing documented issues at> >> https://github.com/jaredwinick/trendulo where you can also obviously> >> find the source.> >>> >>> >> As an example, the following link will show the launch of Instagram's> >> Android client, followed by Facebook's purchase and then a small> >> increase in general "chatter" about the product http://goo.gl/XcCG8> >>> >>> >> Let me know if anyone has any questions or comments. Feel free to> >> tweet @trendulo any interesting searches and I can retweet them out.> >>> >>> >> Jared>

> How many key-values does a single tweet become, on average? What's the storage size per tweet?> > On Wed, Apr 25, 2012 at 12:17 AM, Jared winick <[EMAIL PROTECTED]> wrote:> Thanks for the kind words, I appreciate it. Keith, my ingest process> was down on Mar 19-20, so that is why I am missing data for that> period.> > For those who are curious, I am receiving about 1.2 million tweets a> day and have about 3 billion entries in my main table. I am actually> getting by with everything running on an EC2 medium instance, which is> obviously very far from ideal but I am trying to stay on a budget.> > I hope to add new features as time allows, things like near real-time> trending and geospatial analytics. If anyone has any ideas for> features they think would be interesting, just let me know or add them> as issues on the github page.> > On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi> <[EMAIL PROTECTED]> wrote:> > That's so cool that I'm creating a new section for it on our page of links:> > http://accumulo.apache.org/papers.html> >> > Billie> >> > On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <[EMAIL PROTECTED]> wrote:> >> I gave an Introduction to Apache Accumulo presentation last month at> >> the Boulder/Denver Meetup where I demoed an application that used> >> Accumulo to provide real-time and historical access to words/phrases> >> seen in Twitter messages as well as daily trend analysis. I finally> >> got the demo polished up a bit and running on Amazon EC2 where it can> >> be found at http://trendulo.com .> >>> >> Trendulo is still pretty Alpha at this point so please feel free to> >> add to the existing documented issues at> >> https://github.com/jaredwinick/trendulo where you can also obviously> >> find the source.> >>> >>> >> As an example, the following link will show the launch of Instagram's> >> Android client, followed by Facebook's purchase and then a small> >> increase in general "chatter" about the product http://goo.gl/XcCG8> >>> >>> >> Let me know if anyone has any questions or comments. Feel free to> >> tweet @trendulo any interesting searches and I can retweet them out.> >>> >>> >> Jared>

so a single tweet turns into many key-values for each n-gram/time period. Iwould have to verify but on average I think it works out to about 1 tweetto 60 key-values. I end up seeing from a few hundred entries/sec insertedin the middle of the night to about 2000 entries/sec during peak eveningtimes.

I am not exactly sure how to answer the question about storage size pertweet as I am not actually storing the original tweet and if a counteralready exists for an n-gram/time period, then incrementing that counterdoesn't increase the storage size. I can follow up with the current storageI am using though.

Aaron, I am using EBS now and I haven't seen any problems, that said myload is obviously not extreme. When I initially moved things from my homeworkstation to EC2, I had a few months of tweets to ingest. For thatinitial ingest I did run with local instance storage as I saw extremelyvariable performance when I first tried EBS. The instance storage wasbetter, though not as good as what I see on bare metal.

> Speaking of storage - are you using EBS or local instance storage?>> On Apr 25, 2012, at 8:52 AM, Eric Newton wrote:>> How many key-values does a single tweet become, on average? What's the> storage size per tweet?>> On Wed, Apr 25, 2012 at 12:17 AM, Jared winick <[EMAIL PROTECTED]>wrote:>>> Thanks for the kind words, I appreciate it. Keith, my ingest process>> was down on Mar 19-20, so that is why I am missing data for that>> period.>>>> For those who are curious, I am receiving about 1.2 million tweets a>> day and have about 3 billion entries in my main table. I am actually>> getting by with everything running on an EC2 medium instance, which is>> obviously very far from ideal but I am trying to stay on a budget.>>>> I hope to add new features as time allows, things like near real-time>> trending and geospatial analytics. If anyone has any ideas for>> features they think would be interesting, just let me know or add them>> as issues on the github page.>>>> On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi>> <[EMAIL PROTECTED]> wrote:>> > That's so cool that I'm creating a new section for it on our page of>> links:>> > http://accumulo.apache.org/papers.html>> >>> > Billie>> >>> > On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <>> [EMAIL PROTECTED]> wrote:>> >> I gave an Introduction to Apache Accumulo presentation last month at>> >> the Boulder/Denver Meetup where I demoed an application that used>> >> Accumulo to provide real-time and historical access to words/phrases>> >> seen in Twitter messages as well as daily trend analysis. I finally>> >> got the demo polished up a bit and running on Amazon EC2 where it can>> >> be found at http://trendulo.com .>> >>>> >> Trendulo is still pretty Alpha at this point so please feel free to>> >> add to the existing documented issues at>> >> https://github.com/jaredwinick/trendulo where you can also obviously>> >> find the source.>> >>>> >>>> >> As an example, the following link will show the launch of Instagram's>> >> Android client, followed by Facebook's purchase and then a small>> >> increase in general "chatter" about the product http://goo.gl/XcCG8>> >>>> >>>> >> Let me know if anyone has any questions or comments. Feel free to>> >> tweet @trendulo any interesting searches and I can retweet them out.

> Aaron, I am using EBS now and I haven't seen any problems, that said my load is obviously not extreme. When I initially moved things from my home workstation to EC2, I had a few months of tweets to ingest. For that initial ingest I did run with local instance storage as I saw extremely variable performance when I first tried EBS. The instance storage was better, though not as good as what I see on bare metal.

Thanks for the info. I get the sense that you can scale up a single server more easily using EBS since you can attach like 10 volumes and RAID them up together. More vols might mean less variability too depending on how you configure RAID.

> I gave an Introduction to Apache Accumulo<http://www.slideshare.net/jaredwinick/introduction-to-apache-accumulo> presentation> last month at the Boulder/Denver Meetup<http://www.meetup.com/Boulder-Denver-Big-Data/events/55277392/> where> I demoed an application that used Accumulo to provide real-time and> historical access to words/phrases seen in Twitter messages as well as> daily trend analysis. I finally got the demo polished up a bit and running> on Amazon EC2 where it can be found at http://trendulo.com.>> Trendulo is still pretty Alpha at this point so please feel free to add to> the existing documented issues at https://github.com/jaredwinick/trendulo where> you can also obviously find the source.>> As an example, the following link will show the launch of Instagram's> Android client, followed by Facebook's purchase and then a small increase> in general "chatter" about the product http://goo.gl/XcCG8>> Let me know if anyone has any questions or comments. Feel free to tweet> @trendulo any interesting searches and I can retweet them out.>> Jared>>>

> I am not exactly sure how to answer the question about storage size per> tweet as I am not actually storing the original tweet and if a counter> already exists for an n-gram/time period, then incrementing that counter> doesn't increase the storage size. I can follow up with the current storage> I am using though.>

I see I can make some estimates based on the information in your talk. Theslides are awesome, btw.

Here is an up-to-date estimate. I naively reported disk usage as the "DiskUsed" field under the Accumulo Master section of the monitor. Currently itappears I am only actually using ~26 GB of storage for my Accumulo tables.This is based on the "% Used" * "Unreplicated Capacity" fields in theNameNode section of the monitor which is also corroborated by looking thethe file system usage for the HDFS data directories. I have no other datain HDFS.

Dec 24 - Apr 30 = 128 days3.0 billion entries / 128 days = 23.4 million entries/day23.4 million entries/day / 1.2 million tweets/day ~ 20 entries/tweet (notsure if I misrepresented the number of tweets per day as 3 million before,but it is about 1.2)

26GB / ( 128 * 1.2e6 ) ~ 182 bytes/tweet

I am using the VARLEN encoding for the SummingCombiner which probably helpssave a lot of space as I would imagine there are a lot of entries with avery small count as the language used on Twitter is far from normal.

>> On Wed, Apr 25, 2012 at 3:10 PM, Jared winick <[EMAIL PROTECTED]>wrote:>>> I am not exactly sure how to answer the question about storage size per>> tweet as I am not actually storing the original tweet and if a counter>> already exists for an n-gram/time period, then incrementing that counter>> doesn't increase the storage size. I can follow up with the current storage>> I am using though.>>>> I see I can make some estimates based on the information in your talk. The> slides are awesome, btw.>> Using the information you provided: Dec 24 - March 12... that's 88 days.> 2.6e9 entries, 3 million-ish tweets per day:>> 2.6e9 / (3e6 * 88)>> ~10 entries per tweet.>> Also, you report disk usage of 72G, which I will interpret as 72 * (1024> ** 3) bytes.>> So, each tweet, on average occupies: 72G / (88 * 3e6) Or, ~300 bytes.>> -Eric>

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext