Over the past couple of months, I have been trying to navigate the machine learning job market. It has been a bewildering, confusing, and yet immensely satisfying and informative time. Talking with friends in similar situations, I find a lot of common threads, and I find surprisingly little clarity online regarding this.

So I’ve just decided to put together the sum total of my experiences. Your mileage may vary. After you’re done being a fresher, your situation and what you’re looking for gets a little more unique, so take whatever I say with a pinch of salt.

I’ve been passionate about machine learning for six years or more now. Though I didn’t realize it at that time, a lot of project choices, career choices and course choices I made were with the thought of ‘does this help me get closer to a research-oriented job that involves text mining in some form?’. I went to grad school at a university that was very research oriented and worked on a master’s thesis on an NLP problem, as well as a ton of projects in courses. My first job after that involved NLP in the finance industry. My second job also involved text processing. The jobs I got offers from after this period also involve NLP strongly. I’ve literally never worked on anything else. So you can understand where I’m coming from.

So. Machine learning jobs. Where are they, usually?

Literally everywhere, it turns out. Every company seems to have a research division that involves something to do with data, and data mining. The nature of these positions can vary.

There are positions where you need to have some knowledge of machine learning, and it kind of informs your job, which might or might not involve having to use ML-based solutions. Usually these positions are at large companies. As an example, you might be in a team whose output is, say, an email client. There’s some ML used in some features of the product, and it is important for you to be able to grasp and work around those algorithms, or be able to analyze data, but on a day to day basis you’re working on writing code that doesn’t involve any ML.

There are other similar positions where you deal with a higher volume of data, and they have simple solutions to get meaning out of them. Maybe they use Vowpal Wabbit on a Hadoop cluster on occasion. Or Mahout. But they’ve got the ML bit nailed down, and more of the work involves just doing big data kind of work. These positions are more ubiquitous. If you have some ML on your resume, as well as Hadoop or HBase, these doors open up to you. Most of the places that require this kind of a skillset are mid-sized companies kind of out of the startup phase.

Then you have the Data Scientist positions. This phrase is pretty catchall, and you find a wide variety of positions if you look for this title. Often at big firms, it means that you have knowledge of statistics, and can deal with tools like R, Excel, SQL databases, and maybe Python in order to find insights that help with business decisions. The volume of data you deal with isn’t usually large.

At startups though, this title means a lot more. You are usually interviewing to be the go-to person for all the ML needs in the company. The kind of skills interview all the ones I mentioned above, apart from having a thorough knowledge of other things like scikit-learn and Weka, as well as having worked on ML projects. Some big data experience is usually a plus. Often, you’re finding insights in the data and prototyping things that an engineering team will put in production. Or maybe you’re also doing that if ML is not central to the startup’s core business.

Most people are looking for the Research Engineer job. You aren’t usually coming up with new algorithms. But you’re implementing some. On the upper end of the scale, you’re going through research papers and implementing the algorithms in them and making them work. You need a fair idea of putting code into production and deviate from research in adding layers to things to make your system work in a more deterministic, debuggable fashion. An example would be several jobs at LinkedIn where a lot of the features on the site need you to use collaborative filtering or classification. Increasingly, these jobs work on large data, but often that is not the case, and people manage fine using parallel processing instead of graph databases and mapreduce.

In a mature team, this position might not require you to use your ML skills on a day to day basis. In a new team, this position would need you to work on end to end systems that happen to use ML that you will be implementing.

In larger firms, you probably just need to have worked on ML in grad school, and your past jobs. It doesn’t matter the nature of the kind of data you’ve worked on. In startups though, they start looking for more specific skills. Like they’d want someone who’s specifically worked on topic modelling. Or machine translation. The complexity of their system doesn’t usually call for a PhD. They would grab an off the shelf solution if they could. But they would ideally want someone who has an idea of these things own this component and manage it completely, and be able to hit the ground running, which is why they want someone who’s worked on same or similar things previously.

Which brings me to another point. All ML jobs aren’t equally interviewed for.

Several large as well as mid-sized tech firms hire you for the company, not for a specific team or role. Usually, the recruiter finds you based on buzzwords in your resume, and sets up interviews with you. The folks interviewing you probably work in teams that have nothing to do with your skills. It is possible you go through interviews not answering even one ML question. Later when you get hired, they try to match you to a team, and they try to take into account your ML background to place you in a relevant team. If you’re interviewing for a specific kind of job, this makes it harder as you don’t know until you’re done with the whole process about what kind of work you’ll be doing.

Like I said before, at startups probably, you’ll know exactly what kinds of problems you’ll be working on. But more often, you’re hired into a group of sister teams. They all require similar skills. Maybe they work on different components of the same product, all of which use ML in different ways. So you have a fair idea of what you’ll be working on, but not necessarily a clear picture. You might end up working at the heart of the ML algorithm, or maybe you’re preprocessing text. The interviews will go over your ML background and previous projects as well as ML-related problem-solving.

Then there’s the Applied Researcher role. You usually require a demonstrated capability of working on reasonably complex ML problems. You are occasionally putting things in production and need good coding skills. Often, you’re prototyping things after researching different approaches. When you do put things in production, it is usually tools that other teams that use ML in their solutions use. Language is no bar, but usually there’s an agreed-upon suite of tools and languages that the team uses.

The Researcher role usually requires a PhD. Your team is probably the idea factory of the company, or that particular line of business of that company. Intellectual property generation is part of the job. I’m not highly insightful about this line of work, because I haven’t known very many people opting for these positions, and it feels increasingly like PhDs take up the Applied Researcher/Research Engineer role in a team, and do the prototyping and analyses while others help with that as well as put these prototypes into production.

There’s a lot of overlap in all these different types of positions I’ve mentioned, and it isn’t a watertight classification. It’s a rough guide to the different kinds of positions there are.

So where do you find these jobs?

LinkedIn is a great resource. You can use ‘machine learning’, ‘data mining’, ‘image processing’ or ‘data science’ or ‘text mining’ or ‘natural language processing’ as search keywords. I’ve also found Twitter to be a great place to search for jobs using these same keywords.

There are tons of job boards that also enable you to search using these keywords. Apart from them, I find a lot of ML-specific job fora. There’s KDNuggets Jobs, NLPPeople, LinguistList which are browsable job boards. Apart from them, there are also mailing lists like ML-News and SIG-IRList. I’ve also found /r/MachineLearning on Reddit to be a good resource on occasion for jobs.

Now that you’ve found a position and sent them off your resume and they got back to you, what do you expect in the interview? Wait for my next post to find out!

Someone who works for the Govt of India told me about the Indian Gazette, which published a summary of all the activities of the government in English and Hindi. And there are state gazettes as well, which I assumed did the same. I found that the central government puts out the gazette with the same content in both English and Hindi. As perfect a sentence-alignment as you can expect.

Unfortunately, it doesn’t seem like the Karnataka government does that. They publish everything in only Kannada. The Kerala government publishes in only English. And the Tamil Nadu government publishes some bullet points in English and some in Tamil.

I’d not checked on this earlier, unfortunately. Now I’m back to square one, looking for a dataset for Kannada machine translation. Know of any?

Like this:

For starters, I started using Tess4J which is the Java wrapper around Tesseract. Getting started is reasonably easy. I just followed the instructions here.

And then the problem was, I was constantly getting dependency issues with the DLLs. I used Dependency Walker to diagnose what dependencies weren’t being satisfied. Turned out, msvcp110.dll and msvcr110.dll weren’t installed on my system. I installed them from here.

Then I downloaded the Kannada training data from the Parichit project listed on the Tesseract plugins page. And then I found a larger file and thought maybe that would be better. Apparently not. It resulted in a bunch of errors. There’s something wrong with this training file. It’s detailed here.

Share this:

Like this:

In my last post, I talked about the need for machine translation in Indian languages, and how I was looking for use-cases. I think I’ve found a viable use case and a viable market.

Now that that’s done, I’m looking to do Kannada OCR, followed by language translation. And I’ll document whatever I read, whatever I find, on this blog, for accountability, visibility and discussion.

I start with Kannada OCR. OCR is pretty much the first step to translation, when you are dealing with scanned documents. I found there’s lots and lots of software that deals with this. It occurred to me that it’s not a hard problem at all.

A little more googling gave me Tesseract. It seems to pretty much be the gold standard for OCR. I noticed that ABBYY Finereader doesn’t have Kannada as one of its options… I must admit its API is pretty topnotch. Tesseract is a C++ library. The good thing is, there’s a whole bunch of other language wrappers around it. I can’t seem to find a Python 3 wrapper around Tesseract that also works on Windows, so I suppose I’ll get started on it using Java.

I found a few nice papers on Kannada OCR too. This one is a good introductory, though old, paper. This one is about segmenting old Kannada documents. As someone who doesn’t have much knowledge of what OCR will entail, especially segmentation, I found these useful in my context. I assume there are better, more descriptive papers on OCR as such, and I should read some more comprehensive survey papers on the subject.

Thesetwo papers provide more information on Tesseract as such, and while trying to get it working in Java, I also ought to read them in order to get a more intuitive understanding of the system I’m working with.

Like this:

For a while now, I’ve been pondering the problem of machine translation for Indian languages.

Given India has fourteen official languages, that are pretty damn closely related, and given there are so many enthusiastic people in the NLP domain, we should be at the forefront of machine translation. Unfortunately that is not the case. Yet.

It also bothered me that the current leader in a working machine translation system is Google. Google, while having some of the best scientists and engineers, is American in soul and legality. There are several reasons to have homegrown machine translation systems that are made in India, and which have a more Indian focus.

In any case, I haven’t worked personally on machine translation systems, though I have worked with colleagues who have, and it gives me a vague understanding of how it works. From what I’ve seen, Google is great in the generic case. But if you have a very specific focus, say, in the financial domain, or you want the translations to be conversational, or if you want to restrict yourself to the legal domain, you would need to improve and tweak what translations Google throws at you.

I’ve also seen that most of the machine translation work in Indian languages has been very academic. This is welcome, but in practice, these things don’t usually make it to the market. In my experience, approaching a problem like this from an academic perspective is very different from approaching it as an engineer. In academia, I have largely seen the approaches be technique based. The problem is just a setting to explore new techniques of solving it. This works brilliantly in uncovering new approaches to solving a problem. When I was at UCI, this defined my approaches. To find a newer, more improved technique to do something. As an engineer however, you want to find and implement a ‘good enough’ solution. You want whatever works. You don’t care if you need to have humans in the loop, or if your training data isn’t perfect. I haven’t seen (m)any Indian language translation systems with this approach of using whatever works, getting humans in the loop, and giving imperfect outputs.

I want to try this.

There are so many cool questions I want answered. Like, how easy will it be to translate between closely-related languages like Tamil-malayalam, Hindi-Punjabi, Assamese-Oriya-Bengali, Kannada-Telugu…. and so on? How well will Jason Baldridge’s two-hour-tagging-required POS-tagger work on Indian languages? What happens if I use Sanskrit as interlingua?

I also found that the largest corpus of cross-language translation for Indian languages is the Gazette of India. It is a Govt of India communication, that is posted in English as well as other languages. I think Google uses this for its statistical machine translation heavily. Unsurprisingly, there is a very formal tilt to the translations. This is way more pronounced in Indian languages where formal style is very, very different from casual conversation. Detecting the formality level of an English sentence and translating it appropriately into an Indian language seems like an interesting problem.

Use cases for translation in India are also something I wonder very hard about.

The obvious use case is a generic translation app. This is not something I’m inclined to go head-to-head with Google on. Not right away at least 😉 But it ought to be something we keep coming back to.

The next obvious criterion is an API stack of some sort that others can use to build their apps and other regionalization needs. Google translate API seems to be a clear winner here as well. It will take a while to build something with that level of reliability and generic nature. But not too long, I’d wager.

A good start however would be a niche need. Like maybe translating legal documents from one language to another. Or to English (but then English is an Indian language too 😉 ). I can use Google’s API to generate training data cheaply, and then tweak my built model around for my specific usecase.

Another niche need would be to translate from one Indian language to another in an app that tourists/visitors can use to navigate around town. The kicker here is, how much more useful would your app be as compared to a phrasebook? A more useful app in this context would be one that can read signboards and translate them for you.

Yet another is to help the diaspora and other Indians learn a language through simple translated sentences. Again, this falls into the trap of how much better this would be if it was done like a phrasebook or the app version of “Learn Kannada in 30 days” manuals.

Another idea is to make the Government of India your customer, and help them with their regionalization needs. But then, the government has more bilingual people in the IAS itself than they need, and simple translation is probably not at all an issue when you’re operating at the scale of the government of India.

The dark side of me is thinking up an exciting novel/movie, though. Two idealistic US-educated scientists get inspired by Make In India and go back to make a simple translation app. After a whole lot of failures in monetizing their work, they are suddenly approached by the Govt of India, by the same officials who laughed their idea off earlier. Picture a Paresh Rawal at his droll politician best telling these meek urban types how much their idea will never work in the ‘real India’, and right after the interval coming back with a more serious professional look and demeanor along with the head of R&AW. Now I want this guy to be played by someone who radiates quiet power. Maybe Atul Kulkarni, but he’s got to look a decade older, and a bit more better built. And they find they can instantly become rich if they sell their code to NTRO, to use on the NETRA program (kind of like PRISM). They say thanks but no thanks, but heck, the head of NTRO tells them it’s an offer they can’t refuse. This guy’s got to be a persuasive shades-of-grey sort of wizened spy who used to work in ATT and NASA before he got recruited and had to fake his death and everything and now works under a new name. The two protagonists have a ‘Gasp! It’s him!’ moment of recognition because they’ve actually used a lot of his research to make their software. This NTRO guy can easily be played by Madhavan. And the rest of the plot is about how they decide on what to do with their software, whether they join NTRO, and whether they can sleep at night knowing they are being used to spy on billions of little online conversations every day.

Share this:

Like this:

I found today that Amazon S3 has a really cool one-click backup, where you can set things to back up regularly to Amazon Glacier.

And Amazon DynamoDB also has this thing where you can set it to automatically back up to a table in another region.

You can also set DynamoDB to back up to S3.

Glacier is apparently like a substitute for magnetic tape, without the inconveniences of tape. Takes a while to restore, as well. Pretty cool idea. I wonder what competition exists in this space, currently. A cursory search suggests none.

Glad this is there. It’s pretty essential.

Share this:

Like this:

I had to go to the doctor recently. So the patient sits opposite to the doctor, maybe a little to the right, and the doctor’s in front of a computer, and is keying in things into the hospital management software. The doctor has her back to the door. What I noticed this time was, there’s a mirror at the back of the door. So the patient, from where she sits, can actually see the doctor’s computer monitor.

A bulb of recognition went off.

A long time ago, I’d attended a talk at UCI’s HCI seminar series. I think it was Dr. Yunan Chen’s practice talk for her presentation at CHI. Her research is mainly about device use in the medical field. This particular talk was about an ethnographic study of patients’ perceptions of doctors’ device use.

One thing that the patients had an issue with was doctors typing into a computer as the patients spoke. They wondered what the doctor was typing, whether the doctor actually was listening, and if the doctor was doing something like checking mail or Facebook. And that led to a lot of lack of confidence.

Looks like Swedish has taken into account that research. A mirror at the back of the door is a simple solution. You can be sure the doctor isn’t on Facebook, even if you can’t read what they’re typing through the mirror. And doctors also take time to show you what they’ve written and inform you they’ll be printing it out for you anyway, and that you can have online access to this information as well.

Pretty good, huh, to see something go from research to implementation 🙂

Share this:

Like this:

I don’t remember the last time I posted here. I don’t even think anyone remembers this place exists.

Irrespective.

I’ve grown a lot careerwise. This blog was supposed to help me along that journey, but somehow got ignored by the wayside. Also there’s this overreaching guilt of not doing enough to post here. My big plans still remain. But every time someone asks me about them, I chuckle sadly.

So what’s been on with me? I graduated from UC Irvine under Dr. Ihler in 2011. After that, I was doing NLP for the finance industry for two years. It’s quite an interesting field, I must say. I had one class that covered insider trading and EBITDA and Mergers and Acquisitions and I found all of it enormously interesting. I didn’t unfortunately keep up with my financial knowledge though. I didn’t really need it in what I did on a day to day basis.

And what did I do? I worked on a whole bunch of interesting things. So you have a ginormous quantity of documents coming in in so many different forms, and you need to parse them all and extract data from them. So you end up doing all these extremely basic things. You use OCR to convert image PDFs to text. You parse PDF in all its ugliness and convert it to a simpler format, while taking care to preserving some of the PDF-ey things about PDFs. And then it turns out there are 90 languages and your clients speak English. So you translate 90 languages into English. Some of it’s easy, especially European languages. A lot of it is painful. But we aren’t looking for high-quality translations…. just enough for the numbers in the financial documents to make sense. But then you run into a lot of unique problems. You don’t want to translate Yuan to Dollars. You find that most off-the-shelf translators are built for general language, not finance-specific language, so all the translations are different.

And then you do other interesting stuff with all the stuff you’ve processed so far. You try Named Entity Recognition. You try recommending similar documents. You try identifying series in document streams. You try creating summaries.

All of it was mighty interesting. On a given day, I’d code in C, C#, Perl, Java and Python and it’d all be no big deal. I learnt what MVC and MVVM meant. I began taking a real interest in software design. I learnt how to write maintainable large code. And the benefits of version control.

And then it was time to move.

I work now for a large online retailer’s Search&Discovery division. And that’s all I can say about it. Maybe some day I’ll reminisce fondly on what text mining challenges I face here, the scale of what I work on, and other things that would have by then become old hat. But not now.

I’ve had other interesting experiences with data in the meantime. Facebook NY came up with a Data Science round table. The invitee list looked like Chief Scientist, Head, Data Science, Asst Prof…. you get the drift…. and then me, with less than two years of work experience. It was insanely interesting to meet such people and have them treat me like they had a lot to learn from me. I learnt so much that day that though I’ve forgotten all their names, the discussions are still etched in my mind. It isn’t everyday that you have MCMC sampling explained to you over beer and fries someone else is paying for.

And then I tried a hack I’m not allowed to talk about, and I learnt there’s a feature in POS Taggers called the Gazetteer, where all you do is give it a set of phrases and the POS they belong to, and bam, any occurrence of those phrases (exact matches) is tagged thus. It’s insanely useful when you have your own new part of speech, like say, Celeb Names or Book Titles or some such.

So that’s been what I’ve been upto. Let’s hope I keep up this pace of blogging.

Since then, it’s been a long break from all things hardcore machine learning and data mining and natural language processing. I have a nice day job which pays for my essentials and still leaves me with enough time and money to do a lot of other stuff. My team does a lot of ML, but that does not include what I’m working on at the moment. It might involve me writing some code which learns stuff from data and predicts on some other data, but I don’t know yet.

It’s been a good break. I needed this. I’m a much more confident person now. I have more confidence in my abilities to write and maintain large bits of code. I think it’s about time for me to get back to learning all about machine learning and graphical models with no stress of deadlines and enough opportunity to explore, and most importantly, no feeling intimidated. Also, going through material the second time over would be a good way of absorbing all that I missed the first time.

I’ve been cleaning out my hard disk in order to make conditions ideal for me to do this. A messy filesystem is really hard to work with. Especially with no version control or anything. Things get messy and when it’s crunch time, it only gets worse, not being able to find what you want because you haven’t labelled anything right.

I cleared out all my backups off of my external hard drive. Then I moved my entire pre-NYC-move photographs to the external HD. Going over which individual images to keep and which ones to delete was very cringeworthy – I had been quite camera-happy before 2009, and had clicked a lot of pictures. They say your first 10,000 pictures are your worst. Believe me, mine were. So overtly cringeworthy. More so since back then I didn’t even used to pay attention to how I dressed or how I did my hair or how I maintained my skin. Now those issues don’t exist anymore, so the cringing isn’t coupled with embarrassment and helplessness in my head like before.

I then uninstalled a lot of unnecessary software. Multiple builds of Python, with crazy sets of plugins on each build. Outdated versions of Eclipse. And oh, so many datasets. Deleted what I could, shifted the rest to my external HD. Tried organizing all my music, tagging them appropriately and attempting to put them into the right folders. Wasn’t so easy, so gave up midway. But I discovered that Mp3Tag seems to be a good app to do this.

I then organized my huge collection of ebooks using Calibre. I seem to have a lot of crap I downloaded from Project Gutenberg back in my young-and-foolish days in the infancy of the Google-powered Internet. Somehow, I just can’t delete classic books, no matter how I’ve never read them. So they stay for now.

Turns out, I have tons of movies stored as well, which I’d downloaded off of Putlocker back when I couldn’t even afford Netflix. Organized them well. I also seem to have a small collection of stuff downloaded off Youtube – clever and rare Indian ads, rare music videos of indie Indian pop/rock/movies. I need to upload them back to Youtube someday, for the originals I downloaded from seem pretty much deleted off the face of Youtube.

I even found all the original Stanford Machine Learning Class videos with Prof. Ng. Heh, with Coursera and Udacity, and Khan Academy now, you don’t need any of those like I did back in 2008-09. It was a different time back then, really.

I installed Python 3.2 after that. And Eclipse Juno. Followed by PyDev and the Google App Engine plugin for Eclipse. A windows installer for SciPy exists which is compatible with Python 3.2. However, MatPlotLib’s official Windows installer releases don’t yet support Python 3.2. Thankfully, unofficial ones exist here (oh yay, look, it’s from UCI). I can of course build everything from source, but I want to keep this as hassle-free as possible.

I also need to get started with version control on Google Code or some such, so that I keep all my code somewhere I can access from everywhere.

Now next on the agenda is to go through a machine learning textbook, or an online course and slowly build my own libraries for machine learning from scratch. Maybe I’ll try building a Weka replica – uniform interface for training and testing each algorithm.

After that is to work on probabilistic graphical models and build those from scratch as well.

And in the midst of all this, I want to publish the work I’ve done in my thesis, which will mean trying to replicate those results, in a new and improved way, taking into consideration all the ideas I didn’t have time for, and those which I could have implemented better.

Let’s see how it goes 🙂 I hope to keep updating this place with all the stuff I do 🙂

Share this:

Like this:

I use Twitter quite some. A lot of the people I follow share quite a lot of links. When I browse twitter on my mobile in the morning, I can’t check out all the links. I usually ‘Favorite’ the links that seem interesting and then browse them later. I’d actually prefer a better interface to this, which enables me to tag these links privately so that I can look for them later as well.

I found one such webapp whose name I now forget. The problem with it was it had a sucky interface and didn’t let me preview all the links properly. Then there’s also Tweetree which offers previews of shared links. I also like the Google Reader/Gmail sort of interface which keeps track of new links and already read links. And also, when multiple people share the same link, I’d like to see it all collapsed as one with “X, Y and Z shared this” next to it. Or something.

So this is one thing I’d like to build using Google App Engine.

The steps to do so would be as follows:

Find a nice Twitter API interface for Python which can preferably be integrated with Google App Engine.

Write code to get tweets from your Twitter timeline.
2(a) Learn how to use Twitter OAuth.

Detect tweets with links. When they do, extract the unshortened link.

By now, you have a set of links, and can choose to display them as you wish.

Use the App Engine datastore to store previously viewed links. Possible attributes to be stored along with link can include users who shared this link, timestamps of tweets which shared these links, viewed-or-not (when dropping into database after extraction, this attribute should have the value ‘No’), title of linked page. Also store time of last login.

Workflow: On login, extract links from timeline and drop into database until the timestamp of the tweet you’re reading is lesser than the time of last login. Then display those links with ‘viewed-or-not’ value as ‘No’ as ‘Unread items’ and the rest as ‘read’ items. On clicking each link, mark them as read. Also provide checkboxes to mass-markAsRead.