Archive for the ‘node-js’ Category

I spent this past weekend hunkered down in the basement of the local Elk’s club, working on a project for a hackathon. The project was a tweet ranking web application. The idea was to build a web app that would allow users to login with their Twitter account and view a modified version of their Twitter timeline that shows them tweets ranked by importance. Spending hours every day scrolling through your timeline to keep up with what’s happening in your Twitter network? No more, with Twizzard!

How can we score Tweets to show users their most important Tweets? Users are more likely to be interested in tweets from users they are more similar to and from users they interact with the most. We can calculate metrics to represent these relationships between users, adding an inverse time decay function to ensure that the content at the top of their timeline stays fresh.

That’s one measure of “importance.” Being able to assign a rank would be useful as well, say for the British Library.

Norch is a search engine written for Node.js. Norch uses the Node search-index module which is in turn written using the super fast levelDB library that Google open-sourced in 2011.

The aim of Norch is to make a simple, fast search server, that requires minimal configuration to set up. Norch sacrifices complex functionality for a limited robust feature set, that can be used to set up a freetext search engine for most enterprise scenarios.

Currently Norch features

Full text search

Stopword removal

Faceting

Filtering

Relevance weighting (tf-idf)

Field weighting

Paging (offset and resultset length)

Norch can index any data that is marked up in the appropriate JSON format

Norch is a search engine written for Node.js. Norch uses the Node search-index module which is in turn written using the super fast levelDB library that Google open-sourced in 2011.

The aim of Norch is to make a simple, fast search server, that requires minimal configuration to set up. Norch sacrifices complex functionality for a limited robust feature set, that can be used to set up a freetext search engine for most enterprise scenarios.

Currently Norch features

Full text search

Stopword removal

Faceting

Filtering

Relevance weighting (tf-idf)

Field weighting

Paging (offset and resultset length)

Norch can index any data that is marked up in the appropriate JSON format.

Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided to do a proof of concept with new Javascript tools. This runs Node.js as a backend and uses PDF.js, from Mozilla Labs, to parse PDFs. A full-text index is also built, the beginning of a larger ingestion process.

My primary application has a ton of data, even in its infancy. Hundreds of millions of distinct entities (and growing fast), each with many properties, and many relationships. Numbers in the billions start to be really easy to hit, and then thats still not accounting for organic growth. Most of the data is hierarchical for now, but theres a need in the near term for arbitrary relationships and the quick traversing thereof. Vanilla MySQL in particular is annoying to work when it comes to hierarchical data. Moving to Oracle gets us some nicer toys to play with (CONNECT_BY_ROOT and such), but ultimately, the need for a complimentary database solution emerges.

NOSQL bake-off

While my non-relational db experience is limited to MongoDB (which I love dearly), a graph db seemed to be the better theoretical fit. Requirements: Manage dense, interconnected data that has to be traversed fast, a query language that supports a root cause analysis use case, and some kind of H.A. plan of attack. Signals of Neo4j, OrientDB, and Titan started emerging from the noise. Randomly, I started in with Neo4j with the intent of repeating the test cases on the other contenders assuming any of the 3 met the requirements (in theory, at least). Neo4j has a GREAT “2 minutes to get up and running” experience. Untar, bin/neo4j start, and go to localhost:7474 and you’re off and running. A decent interface waits for you and you can dive right in.

Proof of concept code for testing Neo4j with project data.

The presumption of normalization in Neo4j continues to nag at me.

The broader the reach for data, the less likely normalization is going to be possible, or affordable if possible in some theoretical sense.

It may be that normalization is a presentation aspect of results. Will have to think about that over the holidays.

Seattle is lucky to have KINGFM, a local radio station dedicated to 100% classical music. As one of the few existent classical music fans in his twenties, I listen often enough. Over the past few years, I’ve noticed that when I tune to the station, I always seem to hear the plinky sound of a harpsicord.

Before I sent KINGFM an email, admonishing them for playing so much of an instrument I dislike, I wanted to investigate whether my ears were deceiving me. Perhaps my own distaste for the harpsicord increased its impact in my memory.

This article outlines the details of this investigation and especially the process of collecting the data.
….

Another data collecting/mining post.

If you were collecting this data, how would you reliably share it with others?

In that regard, you might want to consider distinguishing members of the Bach family as a practice run.

For some NLP research I’m currently doing, I was interested in parsing structured information from Wikipedia articles. I did not want to use a full-featured MediaWiki parser. WikiFetch Crawls a Wikipedia article using Node.js and jQuery. It returns a structured JSON-representation of the page.

Harvesting of content (unless you are authoring all of it) is a major part of any topic map project.

Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.

But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.

In this post we’ll be using Hadoop, Pig, mongo-hadoop, MongoDB and Node.js to turn Avro records into a web service. We do so to illustrate Pig’s ability to act as glue between distributed systems, and to show how easy it is to publish data from Hadoop to the web.

I was tempted to add ‘duct tape’ as a category. But there could only be one entry. 😉

Take an early weekend and have some fun with this tomorrow. August will be over sooner than you think.

Posted in Hadoop, MongoDB, node-js, Pig | Comments Off on Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js

One of the most difficult parts about caching is managing dependencies between cache entries. In order to reap the benefits of caching, you typically have to denormalize the data that’s stored in the cache. Since data from child items is then stored within parent items, it can be challenging to figure out what entries to invalidate in the cache in response to changes in data.

As Nate says, a thought experiment but an interesting one.

From a topic map perspective, I don’t know that I would consider cache invalidation and naming things as two distinct problems. Or rather, the same problem under different constraints.

If you don’t think “cache invalidation” is related to naming, what sort of problem is it when a person’s name changes upon marriage? Isn’t a stored record “cached?” May not be cache in the sense of the cache in an online service or chip, but those are the special cases aren’t they?

There are lots of other resources but that should be enough to get you started.

BTW, would “lite” servers with Node.js answer the question of who gets what data we saw in: Linked Data Semantic Issues (same for topic maps?)? Some people might get very little “extra” information if any at all. Others could get quite a bit extra. Would not have to build a monolithic server with every capability.

When we talk about performance what do we mean? There are many metrics that matter in different scenarios but it’s difficult to measure them all. Tom Hughes-Croucher looks at what performance is achievable with Node today, which metrics matter and how to pick the ones that most matter to you. Most importantly he looks at why metrics don’t matter as much as you think and the critical decision making involved in picking a programming language, a framework, or even just the way you write code.

BIOGRAPHY

Tom Hughes-Croucher is the Chief Evangelist at Joyent, sponsors of the Node.js project. Tom mostly spends his days helping companies build really exciting projects with Node and seeing just how far it will scale. Tom is also the author of the O’Reilly book “Up and running with Node.js”. Tom has worked for many well known organizations including Yahoo, NASA and Tesco.

I thought the discussion of metrics was going to be the best part. It is worth your time but I stayed around for the node.js demonstration and it was impressive!

The GT.M data model is a hierarchical associative memory (i.e., multi-dimensional array) that imposes no restrictions on the data types of the indexes and the content – the application logic can impose any schema, dictionary or data organization suited to its problem domain.* GT.M’s compiler for the standard M (also known as MUMPS) scripting language implements full support for ACID (Atomic, Consistent, Isolated, Durable) transactions, using optimistic concurrency control and software transactional memory (STM) that resolves the common mismatch between databases and programming languages. Its unique ability to create and deploy logical multi-site configurations of applications provides unrivaled continuity of business in the face of not just unplanned events, but also planned events, including planned events that include changes to application logic and schema.