Tech Roots » Developmenthttp://blogs.ancestry.com/techroots Ancestry.com Tech Roots BlogsTue, 31 Mar 2015 14:55:00 +0000en-UShourly1http://wordpress.org/?v=3.5.2Scaling Node.js in the Enterprisehttp://blogs.ancestry.com/techroots/scaling-node-js-in-the-enterprise/ http://blogs.ancestry.com/techroots/scaling-node-js-in-the-enterprise/#commentsTue, 31 Mar 2015 14:53:03 +0000Robert Schultzhttp://blogs.ancestry.com/techroots/?p=2978Last year we began an effort internally at Ancestry to determine if we could scale out Node.js within the frontend applications teams. Node.js is a platform that we felt could solve a lot of our needs as a business to build modern, scalable, distributed applications using one of our favorite languages: JavaScript. I want to… Read more

Last year we began an effort internally at Ancestry to determine if we could scale out Node.js within the frontend applications teams. Node.js is a platform that we felt could solve a lot of our needs as a business to build modern, scalable, distributed applications using one of our favorite languages: JavaScript. I want to outline the steps we took, the challenges we have faced and what we learned to this point after six months of officially scaling out a Node.js ecosystem within an enterprise organization.

Guild

We introduced the concept of a guild initially in Q4 of 2014 to get those who are doing anything with or who are interested in Node.js. The guild concept comes from Spotify and their agile engineering model which is a group of people who are passionate about a particular subject. In this case, we wanted to get everyone together to identify the steps we need to take to get Node.js adopted within the organization. We meet once a month and introduce topics related to Node.js which promotes a high level of transparency across the company and anyone is welcome to join and recommend topics. Once we established the guild, it was a great starting point to get passionate people in the same room.

Training

Before we began to invest in Node.js as a platform we wanted to ensure we had a consistent level of knowledge across our engineering group on building Node.js applications. We organized two training sessions for a small group of engineers both in our Provo, UT office and San Francisco offices which was lead by the awesome guys over at nearForm. The group was about 15 engineers in each session. The idea of keeping it small was that we wanted to provide a wide enough level of influence so that the individuals who were part of the training effectively start to build applications and in turn spread their own knowledge. This worked well as we had teams immediately starting to think about components that can be done in Node.js.

Interoperability

As you accumulate multiple technologies in your ecosystem you need to ensure they are all interopable. This means you need to decouple some of your systems, ensure you’re communicating over a common protocol everyone understands such as HTTP and using a good transport such as JSON. We have a lot of backend services in our infrastructure that were built with C#, so in order to support multiple technologies we needed to work with the dependent service teams to ensure we have pure REST services exposed.

We also distribute service clients via NuGet which is the standard package management system for C#, but this is not going to work for any other languages. You will need to ensure that you are building extremely thin clients with well documented API specifications. We want to treat our internal clients like we would with any external consumer of an API. This allows any platform to build on top of our backend services and allows us to prepare and scale for the future on any emerging technologies.

Monolithic Architecture

One of the biggest anti-patterns for Node.js applications are monolithic architectures. This is a pattern of the old where we build large applications that handle multiple responsibilities. The responsibilities were typically hosting the client-side components such as HTML, CSS and JavaScript, hosting an application API with many endpoints and responsibilities, managing the cache of all of the services, rendering the output of each page and so forth. This type of architecture has several problems and risks.

First, it’s extremely volatile for continuous deployment. Rolling out one feature of the application potentially can break the whole application, thus disrupting all of your customers.

It’s also extremely difficult to refactor or rewrite an application down the road if it’s all built as one big large application; having 3 or 4 separate components is easier to rebuild or throw away than 1 large one.

Last, everything should be a service. Everything. Having a large web application that is a combination of different responsibilities goes against this and they should be seperate.

As you begin to break down your monolithic applications, one recommendation is to use a good reverse proxy to route external traffic to new and separate applications while still maintaining integrity on your URI and endpoints.

Documentation

You need to document everything. We created an internal guide on anything and everything related to building Node.js applications at Ancestry. From architecture, best practices, use cases, web frameworks, supported versions, testing, deployment. Anyone within our engineering team who is interested in adopting Node.js is able to use this guide as a first step to get up and running. It ensures that we have an open and transparent model of how to setup, configure, build, test and deploy your applications. The document is an evolving document that we review often together as a group.

Define Governance

Since Node.js is so evolving, it would be wise to establish a small governance group to manage it within your organization. This group should be responsible for defining the standards, adoption of new frameworks, optimizations to architecture and so forth. Again, keep it transparent and open to provide a successful ecosystem. For example, this group decides which web application framework we use such as Express or Hapi.

Scaffolding

It’s extremely important to help engineers get started on a new platform. With technology stacks like Microsoft ASP.NET or Java Spring MVC the convention is a lot more defined. In the Node.js world, there are many different ways to do one thing so we want to make this process a bit more standardized and simple. We also want to ensure all engineers are including common functionality in their applications without having to individually add it in themselves one by one.

So we have built generators by using a tool called Yeoman. It allows you to define templates, or generators as they call them, to scaffold out new Node.js applications easily. This ensures consistency with the Node.js architecture, all common components and middleware is included, an initial set of unit tests with mocks and stubs are added, build tools are configured (such as Grunt to Gulp) and even scripting out your local hosting environment with Vagrant and Docker configuration.

Internal Modules

As your engineering teams begin to scale out efforts in Node.js, you will begin to need cross cutting functionality. One of the principals of Node.js is that it’s great at doing very lean things well. This is a core unix philosophy. In the case of Node.js it should also apply to your common functionality. The package management system for Node.js is NPM. When you build applications you’re essentially building a composite application from open source modules in the community. Today all of these are hosted on npmjs.org. But for larger companies who have security policies in place you do not want to publish your common functionality out to the public so you will need a way to host your modules internally.

Initially we went with Sinopia. It’a an open source NPM registry server that allows you to publish your modules internally to. It also acts as a proxy so that if the module isn’t hosted internally it will go fetch it from npmjs.org and cache it. This is great for hosting all of your common code as well as providing performance improvements since your build system doesn’t have to fetch the package every time.

Over time as more teams begun to publish packages we needed something that would scale better. We introduced Artifactory which provides a lot more functionality and also hosts many other package management systems such as NuGet, Maven, Docker, etc. This allows us to define granular rules around package whitelists, blacklists, aggregation of multiple package sources and more.

Ownership

Building common shared functionality across teams can be difficult to maintain. Our approach was more of an open source model. Each team has the ability to build common functionality that they need to implement a Node.js application, but they must follow a few rules to allow features, bug fixes and enhancements to go into modules. First, they have to define a clear readme.md in their git repository. Second, each module always has an active maintainer. This maintainer is listed right at the top of the readme.md and is the go to for questions or even pull requests. This allows for a flexible ownership model and transparency on these common bits of functionality. You absolutely must agree on your process as an organization for this to work.

Security

When you adopt any new platform you need to ensure security is a top concern. We’ve done this by using the helmet module which gives you a lot of protection against common web attacks like XSS and so forth which suites most of the OWASP Top 10. It’s easy enough for anyone to use and comes as Express middleware. We are also investing in authentication at our middleware layer as well.

You also want to make sure that the modules you’re using are trusted modules. Since the Node.js community is built by free and open source modules there is a risk that an engineer will use one without validating if it’s trusted or secure. We want to use only modules from trusted sources we know or who have a high level of confidence on npmjs.org. This is also where our internal npm registry comes in so that we can effectively blacklist npm modules that do not fit our criteria.

Last, ensure modules are validating your licensing model. Using a module that is MIT license is good. But as an enterprise you may have more strict requirements on other licenses. I recommend looking into off the shelf software to do this or initially investing in some open source tools. There are some npm modules that can do this for you.

DevOps

In your DevOps organization, you will need to make adjustments to support Node.js deployments most likely. A Node.js application deployment works differently than other applications but it’s actually quite simple. Here we use Chef for our provisioning of deployments, so we needed to make adjustments to our Chef recipes to add support for Node.js.

We needed to provision our servers to install Node.js, install Supervisor and install Nginx. We use this setup to gain the hiighest amount of throughput in a production environment.

Supervisor manages the Node.js process to ensure if it dies it is automatically restarted. It also manages the amount of instances of Node.js than run on the server. We take advantage of multiple cores on the server to scale both vertically and horizontally.

Nginx manages the inner process balancing of the incoming requests across the Node.js instances. Nginx is extremely efficient and is able to scale web requests really well. We prefer to use the tools that do a specific job and do it well.

If you have already used Node.js you are aware of the cluster module. The concern with using the cluster module to load balance your requests is that it’s still experimental according to the Node.js stability index. We prefer to build a long lasting model around deploying and managing Node.js instances in the case the cluster module changes it’s API or gets deprecated one day.

Community

The Node.js is a really amazing community. We leverage this community as much as we can in many ways. One way is we have reached out to others in the community to collaborate with them on how they overcame challenges with their adoption of Node.js. We’ve also brought in a few speakers to talk with our engineering group about the same topic as well as build a relationship with others in the community. For example, we’ve invited both Groupon and PayPal in to talk with our group which provided a lot of insight and you recognize than everyone has different business models but we’re after a lot of the same goals in regards to technology such as scalability, performance, security and so forth.

Envy

As we have continued to make progress and start to ship Node.js applications to production, something interesting started to happen. We’ve had other teams begin to want to do new applications in Node.js, prototype some new ideas in Node.js which effectively has started to create engineer envy. The way we want to roll an emerging platform out is through this model. If your engineering team feels there is a problem that is being solved, and it will help them be better at their job then they are much more inclined to adopt it. Happy engineers can ultimately lead to amazing products and new ideas.

Future

So what are our next steps in scaling Node.js here at Ancestry?

We’re continuing to invest in more common and cross cutting concerns. This is crucial to ensure as teams have common dependencies we get them built and in the right way. Optimizations in our architecture. Ensuring everything is exposed as a service communicating over common protocols and transports is crucial for some applications. We continue to make some more introductions to other industry leaders in the Node.js space and be more visible which is extremely helpful. More presence at Node.js meetups. We are also working to host Node.js meetups in our SF office soon.

This year we are also pushing to build our application service architecture around the microservices architecture. This includes also optimizing our application delivery platform with containerization and Docker.

Conclusion

Overall, it’s been an awesome learning experience for us but we are just getting started. But Node.js doesn’t come as free lunch and takes work. Hopefully this may help your adopt it efficiently and give you some tips. Oh, and we’re hiring!

]]>http://blogs.ancestry.com/techroots/scaling-node-js-in-the-enterprise/feed/0Lesson Learned: Sharing Code With Git Submodulehttp://blogs.ancestry.com/techroots/lesson-learned-sharing-code-with-git-submodule/ http://blogs.ancestry.com/techroots/lesson-learned-sharing-code-with-git-submodule/#commentsThu, 26 Feb 2015 03:28:31 +0000Seng Lin Sheehttp://blogs.ancestry.com/techroots/?p=2954You are probably aware of Git Submodules. If you haven’t, you may want to read about it from Atlassian and Git itself. In summary, Git provides a way to embed a reference to a separate project within a main project, while treating both projects as separate entities (versioning, commits etc). This article applies to any… Read more

]]>You are probably aware of Git Submodules. If you haven’t, you may want to read about it from Atlassian and Git itself. In summary, Git provides a way to embed a reference to a separate project within a main project, while treating both projects as separate entities (versioning, commits etc). This article applies to any project that makes use of such scenarios, irrespective of programming languages.

Recently, my team had issues with working with submodules. These ranged from changes in project structure to abrupt change in using tools and commands when working on projects that involve submodules. In the industry, there are opinions that consider Git submoduling as an anti-pattern, where the ideal solution is to reference shared code only as precompiled packages (e.g. NuGet, Nexus etc).

This post is a short reflection on how we can restrict ourselves to only certain scenarios and how best to use projects utilizing submodules daily.

When should you use submodules?

When you want to share code with multiple projects

When you want to work on the code in the submodule at the same time as you work on the code in the main project

In a different light, this blog highlights several scenarios where Git submodule will actually make your project management a living nightmare.

Why use submodules?

To solve the following problems:

When a new person starts working on a project, you’ll want to avoid having this person find out the individual shared reference repositories that need to be cloned. The main repository should be self-contained.

In a CI (continuous integration) environment, plainly hooking up shared reference Git repositories as material pulls would be detrimental as there is no coupling of versions between modules. Any modification in a repository would trigger the dependent CI pipeline; hence possibly causing a pipeline to be blocked if there is a breaking change.

Allow independent development of projects. For example, say both projects, ConsumerProject1 and ConsumerProject2 which depend on a SharedProject can be worked on without worrying about breaking changes that would affect the pipeline status (which may block development and deployment of the separate project/services).

How should submodules be restricted?

We found that the best way to prevent complexity from creeping in to this methodology is to do the following:

Avoid nested submodule structures, meaning that submodules containing other submodules, which may share the same submodules as the others, thus creating duplicates. Thus, the parent repository would NEVER be a shared project.

Depending on the development environment (i.e. Visual Studio) submodules should only be worked on when opened through the solution file of the parent repository. This is to ensure consistent relative path works across other parent repositories which consume the same submodules.

Submodules should always be added to the root of the parent repository (for consistency).

The parent repository would be responsible for satisfying the dependency requirements of submodules by linking necessary submodules together, similar to the responsibility of an IoC (Inversion of Control) container (i.e. Unity).

What are the main Git commands when working with submodules?

When importing a repository that contain submodules:

git clone --recursive <url>

When pulling in latest change for a repository (e.g. parent project) that contain submodules:

git pull
git submodule update --init –recursive

The ‘update’ command is used to update the contents of the submodule folders after the first git pull updates the commit references of the submodules in the parent project (yes, it’s weird)

When you want to update all submodules for a repository to their latest commit

git submodule update --init --remote --recursive

By default, the above submodule update commands will result in your submodules being in a detached state. Before you begin work, create a branch to track all changes.

So, how do you use Git submodules? What best practices do you use to keep Git modules well-managed and within a certain amount of complexity? Do share your experience and feedback here in the comment section below.

]]>http://blogs.ancestry.com/techroots/lesson-learned-sharing-code-with-git-submodule/feed/0Monitoring progress of SOA HPC jobs programmaticallyhttp://blogs.ancestry.com/techroots/monitoring-progress-of-soa-hpc-jobs-programmatically/ http://blogs.ancestry.com/techroots/monitoring-progress-of-soa-hpc-jobs-programmatically/#commentsFri, 17 Oct 2014 14:15:27 +0000Chad Gronemanhttp://blogs.ancestry.com/techroots/?p=2873Here at Ancestry.com, we currently use Microsoft’s High Performance Computing (HPC) cluster to do a variety of things. My team has multiple things we use an HPC cluster for. Interestingly enough, we don’t communicate with HPC exactly the same for any distinct job type. We’re using the Service Oriented Architecture (SOA) model for two of… Read more

]]>Here at Ancestry.com, we currently use Microsoft’s High Performance Computing (HPC) cluster to do a variety of things. My team has multiple things we use an HPC cluster for. Interestingly enough, we don’t communicate with HPC exactly the same for any distinct job type. We’re using the Service Oriented Architecture (SOA) model for two of our use cases, but even those communicate differently.

Recently, I was working on a problem where I wanted our program to know exactly how many tasks in a job had completed (not just the percentage of progress), similar to what can be seen in HPC Job manager. The code for these HPC jobs uses the BrokerClient to send tasks. With the BrokerClient, you can “fire and forget”, which is what this solution does. I should note that the BrokerClient can retrieve results, after the job is finished, but that wasn’t my use case. I thought there should be a simple way to ask HPC how many tasks had completed. It turns out that this is not as easy as you might expect, when using the SOA model. I couldn’t find any documentation on how to do it. I found a solution that worked for me, and I thought I’d share it.

HPC Session Request Breakdown, as shown in HPC Job Manager

With a BrokerClient, your link back to the HPC job comes from the Session object used to create the BrokerClient. From a Scheduler, you can get your ISchedulerJob that corresponds with the Session by matching the ISchedulerJob.Id to the Session.Id. My first thought was to use ISchedulerJob.GetTaskList() to retrieve the tasks and look at the task details. It turns out that for SOA jobs, tasks do not correspond to requests. The tasks don’t have any methods on them to indicate how many requests they’ve fulfilled, either.

My solution was found while looking at the results of the ISchedulerJob.GetCustomProperties() method. I was surprised to find the solution there, since the MSDN documentation states that this is “application-defined properties”.

I found four name-value pairs which may be useful for knowing the state of tasks in a SOA job, with the following keys:

“HPC_Calculating”

“HPC_Caclulated”

“HPC_Faulted”

“HPC_PurgedProcessed”

I should note that some of these properties don’t exist when the job is brand new, with no requests sent to it yet. Also, I was disappointed to find no key corresponding to the “incoming” requests, since some applications might not be able to calculate that themselves.

With that information, I was able to write code to monitor the SOA jobs.

With all that said, I should also say that our other SOA HPC use case monitors the state of the tasks, and is capable of more detailed real-time information. We do this by creating our own ChannelFactory and channels. By using that, the requests are not “fire and forget” – we get results back from each request individually as it completes. We know how many outstanding requests there are, and how many have completed. If we wanted to, we could use the same solution presented for the BrokerClient to find out how many are in the “calculating” state.

One last disclaimer: These “Custom Properties” are not documented, but they are publicly exposed. Microsoft could change them. If they ever do, I hope they would consider it a breaking change, and document it. There are no guarantees of that, so use discretion when considering this solution.

]]>http://blogs.ancestry.com/techroots/monitoring-progress-of-soa-hpc-jobs-programmatically/feed/2Big Data for Developers at Ancestryhttp://blogs.ancestry.com/techroots/big-data-for-developers-at-ancestry/ http://blogs.ancestry.com/techroots/big-data-for-developers-at-ancestry/#commentsThu, 25 Sep 2014 22:59:00 +0000Seng Lin Sheehttp://blogs.ancestry.com/techroots/?p=2800Big Data has been all the craze. Business, marketing and project managers like it because they can plot out trends to make decisions. To us developers, Big Data is just a bunch of logs. In this blog post, I would like to point out that Big Data (or logs with context) can be leveraged by… Read more

]]>Big Data has been all the craze. Business, marketing and project managers like it because they can plot out trends to make decisions. To us developers, Big Data is just a bunch of logs. In this blog post, I would like to point out that Big Data (or logs with context) can be leveraged by development teams to understand how our APIs are used.

Developers have implemented logging for a very long time. There are transaction logs, error logs, access logs and more. So, how has logging changed today? Big Data is not all that different from logging. In fact, I would consider Big Data logs as logs with context. Context allows you to do perform interesting things with the data. Now, we can correlate user activity with what’s happening in the system.

A Different Type of Log

So, what are logs? Logs are record of events, and frequently created in the case of applications with very little user interaction. It goes without saying that many logs are transaction logs or error logs.

However, there is a difference between forensics and business logs. Big Data is normally associated with events, actions and behaviors of users when using the system. Examples include records of purchases, which are linked to a user profile and spanned across time. We call these business logs. Data and business analysis would love to get a hold on this data; run some machine learning algorithms and finally predict the outcome of a certain decision to improve user experience.

Now back to the developer. How does Big Data help us? On our end, we can utilize forensics logs. Logs get more interesting and helpful if we can combine records from multiple sources. Imagine; hooking in and correlating IIS logs, method logs and performance counters together.

Big Data for Monitoring and Forensics

I would like to advocate that Big Data can and should be leveraged by web service developers to:

Better understanding the system and improve performance of critical paths

Investigate failure trends which might lead to errors or exacerbate current issues.

Chain of calls (e.g. method names, server names etc.) This can be used to trace where method calls originate

With the various data being logged for every single call, it is important that the logging system is able to hold and process huge volume of data. Big Data has to be handled on a whole different scale. The screenshots below are charts from Kibana. Please refer here to find out how to set up data collection and dashboard display using this suite of open source tools.

Example Usage

Based on the decision as to what kind of monitoring is required, the relevant information (e.g. context, method latency, class/method names) should be included in Big Data logs.

Detecting Problematic Dependencies

Plotting time spent in classes of incoming and outgoing components provides us with visibility into the proportion amount of time spent in each layer of the service. The plot below revealed that the service was spending more and more time in a particular component; thus warranting an investigation.

Discovering Faulty Queries

Logging all exceptions, together with the appropriate error messages and details, allows the developers to determine the circumstances under which a method would fail. The plot below shows that MySql Exceptions started occurring at 17:30. Due to the team including parameters within logs, we were able to determine that invalid queries were used (typos and syntax errors).

Determine Traffic Pattern

Tapping into the IP address of incoming request reveals very interest traffic patterns. In the example below, the graph indicates a spike in traffic. However, upon closer look, this graph shows that spike spanned across ALL countries. This concludes that this spike in traffic is not due to user behavior and this leads to further investigation other possible causes (e.g., DOS attacks, simultaneous updates for mobile apps, error in logs etc.) In this case, we found out it was a false positive; repeated reads in log forwarders through the logging infrastructure.

Determine Faulty Dependents (as opposed to dependencies)

Big Data log generations can be enhanced to include IDs to track the chain of service calls from clients through to the various services in the system. The first column below indicates that traffic from the iOS mobile app passes through the External API gateway before reaching our service. Other columns indicate different flows, thus allowing developers enough information to detect and isolate problems to different systems if needed.

Tracking Progression Through Various Services

Ancestry.com has implemented a Big Data framework across all services to support call tracking across different services. This helps developers (who are knowledgeable on the underlying architecture) to debug whenever a scenario doesn’t work as expected. The graph below depicts different methods being exercised across different services, where each color refers to a single scenario. Such data provides full visibility on the interaction amongst different services across the organization.

Summary

Forensic logs can be harnessed and used with Big Data tools and framework to greatly improve the effectiveness of development teams. By combining various views (such as the examples above) into a single dashboard, we are able to provide developers with a health snapshot of the system at any time in order to determine failures or to improve architectural designs.

By leveraging Big Data for forensics logging, we, as developers are able to determine faults and reproduce errors messages without the conventional debugging tools. We have full visibility into the various processes in the system (assuming we have sufficient logs). Gone were the days when we need to instrument code on LIVE boxes because the issue only occurs in the LIVE environment.

All of these work are done independently of the Business Analysts and are in fact, very crucial to the agility of the team to quickly react to issues and to continuously improve the system.

Do your developers use Big Data as part of daily development and maintenance of web services? What would you add to increase visibility in the system and to reduce bug-detection time?

]]>http://blogs.ancestry.com/techroots/big-data-for-developers-at-ancestry/feed/2Migrating From TFS to Git-based Repositories (II)http://blogs.ancestry.com/techroots/migrating-from-tfs-to-git-based-repositories-ii/ http://blogs.ancestry.com/techroots/migrating-from-tfs-to-git-based-repositories-ii/#commentsFri, 08 Aug 2014 20:38:59 +0000Seng Lin Sheehttp://blogs.ancestry.com/techroots/?p=2579Previously, I wrote about why Git-based repositories have become popular and why TFS users ought to migrate to Git. In this article, I would like to take a stab at providing a quick guide for longtime TFS / Visual Studio users to quickly ramp up on the knowledge required to work on Git-based repositories. This… Read more

]]>Previously, I wrote about why Git-based repositories have become popular and why TFS users ought to migrate to Git.

In this article, I would like to take a stab at providing a quick guide for longtime TFS / Visual Studio users to quickly ramp up on the knowledge required to work on Git-based repositories. This article will try to present Git usage based on the perspective of a TFS user. Of course, there may be some Git-only concepts, but I will try my best to lower the learning curve for the reader.

I do not intend to thoroughly explore the basic Git concepts. There are very good tutorials out there with amazing visualizations (e.g. see Git tutorial). However, this is more like a no-frills quick guide for no-messing-around people to quickly get something done in a Git-based world (Am I hitting a nerve yet? ).

Visual Studio has done a good job abstracting the complex commands behind the scenes, though I would highly recommend going through the nitty-gritty details of each Git command if you are vested in using Git for the long term.

For this tutorial, I only require that you have one of the following installed:

Remapping your TFS-trained brain to a Git-based one

TFS Terminology

You will start your work on a TFS solution by synchronizing the repository to your local folder.

Every time you modify a file, you will check out that file.

Checking-in a file commits the change to the central repository; hence, it requires all contributors who are working on that file to ensure that conflicts have been resolved.

The one thing to note is that TFS keeps track of the entire content of files, rather than the changes made to the contents.

Additionally, versioning and branching requires developers to obtain special rights to the TFS project repository.

Git Terminology

If you are part of the core contributor group (left part of diagram):

Git, on the other hand, introduces the concept of a local repository. Each local repository represents a standalone repository that allows the contributor to continue working, even when there is no network connection.

The contributor is able to commit work to the local repository and create branches based on the last snapshot taken from the remote repository.

When the contributor is ready to push the changes to the main repository, a sync is performed (pull, followed by a push). If conflicts do occur, a fetch and a merge are performed which requires the contributor to resolve conflicts.

Following conflict resolution, a commit is performed against the local repository and then finally a sync back to the remote repository.

The image above excludes the branching concept. You can read more about it here.

If you are an interested third party who wants to contribute (right part of diagram):

The selling point of Git is the ability for external users (who have read-only access) to contribute (with control from registered contributors).

Anyone who has read-only access is able to set up a Personal Project within the Git service and fork the repository.

Within this project, the external contributor has full access to modify any files. This Personal Project also has a remote repository and local repository component. Once ready, the helpful contributor may launch a pull request against the contributors of the main Team Project (see above).

With Git, unregistered contributors are able to get involved and contribute to the main Team Project without risking breaking the main project.

There can be as many personal projects forking from any repositories as needed.

*It should be noted that any projects (be it Personal or Team Projects) can step up to be the main working project in the event that the other projects disappear/lose popularity. Welcome to the wild world of open source development.

Guide to Your Very First Git Project

Outlined below are the steps you will take to make your very first check-in to a Git repository. This walkthrough assumes you are new to Git but have been using Visual Studio + TFS for a period of time.

Start from the very top and make your way to the bottom by trying out different approaches based on your situation and scenario. These approaches are the fork and pull (blue) and the shared repository (green) models. I intentionally present the feature branching model (yellow) (which I am not elaborating in this article) to show the similarities. You can read about these collaborative development models here.

Feel free to click on any particular step to learn more about it in detail.

Migrating from TFS to Git-based Repository

Create a new folder for your repository. TFS and Git temp files do not play nicely with each other. The following would initialize the folder with the relevant files for Visual Studio projects and solutions (namely .ignore and .attribute files).

Copy your TFS project over to this folder.

I would advise running the following command to remove “READ ONLY” flags for all files in the folder (this is automatically set by TFS when files are not checked out).>attrib -R /S /D *

Click on Changes.

You will notice the generate files (.gitattributes & .gitignore). For now, you want to add all the files that you have just added. Click Add All under the Untracked Files drop down menu.

Then, click the Unsynced Commits.

Enter the URL of the newly created remote repository. This URL is obtained from the Git service when creating a new project repository.

You will be greeted with the following prompt:

You will then see Visual Studio solution files listed in the view. If you do not have solution files, then unfortunately, you will have to rely on tools such as the Git command line or other visualization tools.

Clone Repository onto Your Local Machine

Within Visual Studio, in the Team Explorer bar,

Click the Connect to Team Projects.

Clone a repository by entering the URL of the remote repository.

Provide a path for your local repository files.

You will then see Visual Studio solution files listed in the view. If you do not have solution files, then unfortunately, you will have to rely on tools such as the Git command line or other visualization tools.

Pulling from the Remote Repository and Conflict Resolution

If no other contributors have added a change that conflicts with your change, you are good to go. Otherwise, the following message will appear:

Click on Resolve the Conflict link to bring up the Resolve Conflict page. This is similar to conflict resolution in TFS. Click on Merge to bring up the Merge Window.

Once you are done merging, hit the Accept Merge button.

Merging creates new changes on top of your existing changes to match the base change in the remote repository. Click Commit Merge, followed by Commit in order to commit this change to your local repository.

Now, you can finally click Sync.

If you see the following message, you have completed the “check-in” to your remote repository.

You are probably wondering what the difference is between branching and forking. Here is a good answer to that question. One simple answer is that you have to be a registered collaborator in order to make a branch or pull/push an existing branch.

Each Git service has its own way of creating a fork. The feature will be available when you have selected the right repository project and a branch to fork. Here are the references for GitHub and Stash respectively.

Once you have forked a repository, you will have your own personal URL for the newly created/cloned repository.

Most Git services will be able to trigger a pull request within the branch view of the repository. Please read these sites for specific instructions for BitBucket, GitHub and Stash.

A pull request can only be approved if there are no conflicts with the targeted branch. Otherwise, the repository will provide specific instructions to merge changes from the main repository back to your working repository.

Summary

Git is a newer approach to version control and has been very successful with open source projects as well as with development teams who adopt the open source methodology. There are benefits for both Git and TFS repositories. Some projects may not be suitable candidates for adopting Git, whereas some are appropriate. These factors include team size, team dynamics, project cadence and requirements.

What are your thoughts about when Git should be the preferred version control for a project? What is the best approach for lowering the learning curve for long-term TFS users? How was your (or your team’s) experience in migrating towards full Git adoption? Did it work out? What Git tools do you use to improve Git-related tasks? Please share your experience in the comment section below.

]]>http://blogs.ancestry.com/techroots/migrating-from-tfs-to-git-based-repositories-ii/feed/2Controlling Costs in a Cloudy Environmenthttp://blogs.ancestry.com/techroots/controlling-costs-in-a-cloudy-environment/ http://blogs.ancestry.com/techroots/controlling-costs-in-a-cloudy-environment/#commentsTue, 24 Jun 2014 20:11:03 +0000Daniel Sandshttp://blogs.ancestry.com/techroots/?p=2500From an engineering and development standpoint, one of the most important aspects of cloud infrastructure is the concept of unlimited resources. The idea of being able to get a new server to experiment with, or being able to spin up more servers on the fly to handle a traffic spike is a foundational benefit of… Read more

]]>From an engineering and development standpoint, one of the most important aspects of cloud infrastructure is the concept of unlimited resources. The idea of being able to get a new server to experiment with, or being able to spin up more servers on the fly to handle a traffic spike is a foundational benefit of cloud architectures. This is handled in a variety of different ways with various cloud providers, but there is one thing that they all share in common:

Capacity costs money. The more capacity you use, the more it costs.

So how do we provide unlimited resources to our development and operations groups without it costing us an arm and a leg? The answer is remarkably simple. Visibility is the key to controlling costs on cloud platforms. Team leads and managers with visibility into how much their cloud based resources are costing them can make intelligent decisions with regard to their own budgets. Without decent visibility into the costs involved in a project, overruns are inevitable.

This kind of cost tracking and analysis has been the bane of accounting groups for years, but there are several projects that have cropped up to tackle the problem. Projects like Netflix ICE provide open source tools to track costs in public cloud environments. Private cloud architectures are starting to catch up to public clouds with projects like Ceilometer in Open Stack, but can be a bit trickier to determine accurate costs due to the variables involved in a custom internal architecture.

The most important thing in managing costs of any nature is to realistically know what the costs are. Without this vital information, effectively managing the costs associated with infrastructure overhead can be nearly impossible.

]]>http://blogs.ancestry.com/techroots/controlling-costs-in-a-cloudy-environment/feed/0Dealing with Your Team’s Bell Curvehttp://blogs.ancestry.com/techroots/dealing-with-your-teams-bell-curve/ http://blogs.ancestry.com/techroots/dealing-with-your-teams-bell-curve/#commentsFri, 06 Jun 2014 21:01:49 +0000Daniel Sandshttp://blogs.ancestry.com/techroots/?p=2471I recently came across this article on the INTUIT QuickBase blog and was intrigued by the premise. It asserts that inside any team or organization, you will have a bell curve of talent and intelligence – which most would agree to. It’s not a bad thing, it just happens. Regardless of how well staffed you… Read more

]]>I recently came across this article on the INTUIT QuickBase blog and was intrigued by the premise. It asserts that inside any team or organization, you will have a bell curve of talent and intelligence – which most would agree to. It’s not a bad thing, it just happens. Regardless of how well staffed you are or how many experts you recruit, there will always be someone who stands out above the rest and someone who lags behind. Lagging behind is in this case a very relative matter and the so-called lagging individual may in fact be generating brilliant work. This curve seems to naturally exist.

While the article discusses how the groups respond to the least of the group, my interest was instead peaked by another thought. How do we each perceive ourselves within the group? From where I am standing, where do I think I am on the bell curve? In my own team, I know of individuals who depreciate their own perceived value, verbally expressing that others contribute more, have a better response time or whatever criteria you wish to judge on. That perspective can actually be quite dangerous as someone of great value may view themselves as insufficient. On the other hand, someone who views themselves as a rock star may be all flash and no substance.

More than anything, the concept triggered an awareness of my own team and helped me to think a little more about those around me and be more sensitive to issues and circumstances that I may not have otherwise thought about. All in all a good read if you have a few minutes.

I’ll echo the author’s question at the end of her article, how has the bell curve on your team affected business culture and team efficacy?

]]>http://blogs.ancestry.com/techroots/dealing-with-your-teams-bell-curve/feed/0Find A Grave Engineeringhttp://blogs.ancestry.com/techroots/find-a-grave-engineering/ http://blogs.ancestry.com/techroots/find-a-grave-engineering/#commentsWed, 21 May 2014 04:04:18 +0000Robert Schultzhttp://blogs.ancestry.com/techroots/?p=2370Last October Ancestry.com acquired a very exciting property called Find A Grave which focused on collecting content around the graves of family, loved ones and famous people. With the acquisition we wanted to take Find A Grave to the next level and provide the current users new and better experiences around consuming and contributing content.… Read more

]]>Last October Ancestry.com acquired a very exciting property called Find A Grave which focused on collecting content around the graves of family, loved ones and famous people. With the acquisition we wanted to take Find A Grave to the next level and provide the current users new and better experiences around consuming and contributing content. Find A Grave has been around for over 15 years and has done a tremendous job of organically growing the website and user base.

Around the time of the acquisition I was asked to lead the engineering efforts, manage a successful transition and focus on new experiences. But this was a challenge because Find A Grave serves millions of page views a day, was running on a different web technology stack, sitting in a set of physical rack mount servers in a separate data center at the time and contains over 100 million photos and over 116 million memorials.

Our objective was to transition all hardware and code to a modernized infrastructure within a short amount of time to provide the Find A Grave users with a better performing experience (quick page load times) and support future scale and growth over the years.

We also knew we needed to build a mobile app; it just makes sense. One of the primary goals of users for the site is to go out into cemeteries and take photos of grave sites. Before the Find A Grave app it was an arduous and quite disconnected workflow to take photos of a grave site, come back home, download all of the photos to a computer and then upload each photo to the site one by one, etc. As you can see, if you have 100 photos to process this can be quite a pain!

So our goal is to make mobile front and center for Find A Grave and we started with iOS.

Building the Team

The first step was to build the team and get moving, and I quickly hired two amazingly talented iOS developers to focus on the iOS app, John Mead and Shengzhe Chen. John came from the freelance world working on many different iOS applications for different companies and startups. And Shengzhe was doing mobile payment development before joining Ancestry.com.

We then hired a backend engineer to come in and bring some really awesome skills to the team, Prasanna Ramanujam. He is currently positioning Find A Grave to be a very API driven application which allows us to work with many external clients from the ground up. Prasanna comes from NodePrime as an engineer on their Node.js related datacenter infrastructure, and previously from VMware doing Node.js evangelism.

Next we hired another full stack engineer hero, Bob Dowling, to first focus on building a new, modern and flexible fronted to the application. We want to provide a more snappy and performant fronted to the website that ever before which includes optimizations throughout the whole stack. Bob comes from the freelance world as well focusing on both mobile and web applications.

And just recently we have an awesome new intern helping out, Shruti Joshi, who will be working with us to get the new website built.

And of course our product owner Mike Lawless, our designer Jonathan Rumella and the founder of Find A Grave, Jim Tipton, who have all been a tremendous help with all aspects of the team.

Infrastructure

Within a few months we were able to migrate all infrastructure, photos, database and content to the Ancestry data center. It was a very complex task as we needed to keep the existing site running live during this whole process. We also wanted to modernize the infrastructure and virtualize everything in our private virtual cloud. Part of this was using Go for continuous integration and Chef for deployment and provisioning to our virtual instances. This will allow us to scale to future growth beyond the acquisition.

iOS

As a team we worked closely with our product manager, designer, and the Find A Grave founder to work through what we thought would be an amazing 1.0 iOS application. We were able to work very closely using some extremely lean principals to build quickly and efficiently. From conception to release the two person team was able to build a 1.0 release that contained over 10 major features to get a great mobile experience into our users hands in just about four months. We released the first version to the public at the beginning of March 2014 and we’re excited that millions of people will be able to use this over Memorial Day weekend to remember their loved ones. You can check out the app here.

Backend

Find A Grave grew as a Perl based web application over the past 15 years, so we are looking to build things from the ground up using Node.js. It works perfectly for our JSON API layer, provided great real-time support for our iOS app and works great for some of the new NoSQL infrastructure we’re putting in.

Website

We also know that the Find A Grave website needs a new, fresh facelift. Utilizing our UX team we are beginning to envision what that new look and feel looks like and trust us, it’s going to be amazing. This is where we are going to harness the power of a very client-driven web application that will utilize the new backend API services heavily using Backbone.js and Handlebar templating engine.

Accomplishments

To recap, we’ve built an amazing Find A Grave engineering team. We migrated all infrastructure in-house and virtualized everything through proper continuous integration and delivery platforms. We’ve released an iOS app with some more great features coming out soon. We’re rebuilding the backend from the ground up. Next on our list is giving the website a new facelift to provide a better user experience. Whew, that’s a lot in just 6 months to get started on.

]]>http://blogs.ancestry.com/techroots/find-a-grave-engineering/feed/9Migrating From TFS to Git-based Repositories (Part I)http://blogs.ancestry.com/techroots/migrating-from-tfs-to-git-based-repositories-part-i/ http://blogs.ancestry.com/techroots/migrating-from-tfs-to-git-based-repositories-part-i/#commentsTue, 29 Apr 2014 17:42:32 +0000Seng Lin Sheehttp://blogs.ancestry.com/techroots/?p=2317Git, a distributed revision control and source code management system has been making waves for years, and many software houses have been slowly adopting this system as not only their source code repository, but also as a way software development projects are managed. There is much debate about using either a centralized or distributed revision… Read more

]]>Git, a distributed revision control and source code management system has been making waves for years, and many software houses have been slowly adopting this system as not only their source code repository, but also as a way software development projects are managed.

There is much debate about using either a centralized or distributed revision control system, so I am not going to delve into promoting one system over the other. What I hope to do is shed some light on the rationale and concepts of Git with long-time Team Foundation Server (TFS) users. I would also like to provide a mini tutorial on how to quickly get started on your recently migrated Git projects. The following blog article is my opinion and experience of Git and does not necessarily reflect the position of the company.

Today’s blog article is a summary of what I have learned about Git and the rationale to migrate away from TFS; targeted to long time TFS users.

What is Git?

Git is a distributed revision control and source code management (SCM) system. Note that the keyword here is distributed. The main difference from central revision control systems like TFS, is that there can be no one main repository. Anyone can fork a branch (assuming it’s public) into his/her personal repository and anyone can participate in improving anybody’s branch. Git allows local (on your machine) repositories to be disconnected from the network completely.

If you would like to explore the different repository installations, check out the MSDN blog discussion here and a heated discussion on Stack Overflow here. My two cents: TFS is still viable for projects that require tight integration with Visual Studio and Microsoft products. In these scenarios, TFS functions as more than just a source code repository system.

I find that the following talk by Linus Torvalds explains the true concepts on why Git operates the way it is, and why it is architected so.

Not bothered to watch the whole thing? Summary: Git is architected for an open (as in open source) and distributed (a network of developers) development process.

How Git improves on TFS

I’m approaching Git as a long time TFS user. The following paragraphs assumes that you have sufficient experience in TFS.

Feature work branching

Have you ever found it frustrating to work on multiple features at the same time, and juggle to provide a shelveset that is as minimal as possible, in order to make code reviews easier? Have you ever copied all files in the repository in a different folder just to make sure that the baseline repository is not buggy in the first place? Do you squirm when another team member checks in before you do; which results in you spending a chunk of your time syncing it back in order to check it in?

Git provides an easy way to work on different work items separately by creating branches within your repository. Git doesn’t store data as a series of changesets or deltas, but instead as a series of snapshots. When merging, you do not need to make sure you have all the previous changes in the targeted branch.

All of this can be done without duplicating your project folder and without connecting to the network.

Offline work

Have you ever been forced to wait for slow network connections in order to check out files? What if the network connection goes down? What if you are working away from office and do not have a stable internet connection?

Within a standard workflow, you would always work off a local copy of the repository (known as the local branch). However, apart from that, there is a concept of a remote branch, that represents the state of branches on your remote repositories. Your changes would not be made available to everyone until you ‘push’ your changes back to the remote repository. So, yes. Git still uses a central repository as a communication hub for all developers.

A remote branch represents a snapshot in time of the remote repository on the remote server. With a remote branch snapshot (fetched) locally, you can always fork various copies of that branch to begin different feature work without any connection to the repository server.

Open Collaboration

TFS is tightly integrated with LDAP/Windows Authentication and makes use of this to provide secured access to the source code. However, this hinders wide spread participation from potential coders due to access requirements for checking out files and submitting changes. There is no easy way to fully utilize the capability of the source code repository (code review / change tracking) unless you have certain level of permissions to the project repository.

Git encourages an open source development model, whereby multiple people can fork/duplicate projects easily. If the main branch is only maintained by a single individual, any contributor can propose changes by forking a branch, make the modification and then submitting a ‘pull request’. This essentially means “pulling in a change” or putting the changeset under consideration by the project administrator.

Web Based Code Review

Web Interfaces such as GitHub and Stash have made it much easier to manage projects, pull requests (shelvesets for considerations) and even code reviews. These web interfaces provides a great channel for communication among developers and even observers (this can include managers and product owners) to even participate in the review process; without even using any development tools.

Backup

If the project has many developers, each clone copy of the project on a developer’s machine is effectively a backup. Git intrinsically saves an entire history of snapshots within your local repository. TFS on the other hand saves only the latest synced version of your files.

When not to consider Git

Git it a totally different beast and has different concepts from TFS. It requires some retraining in order to fully utilize the capability of Git.

If you have a small team and do not benefit from any of the above features that I presented above, then there is no significant benefit of migrating away from TFS. In fact, you would lose productive hours and even days learning new syntax and flows, just for the sake of using a new technology.

At the end of the day, it depends on what workflow suits your development projects and if the organization wants centralized control.

Summary

Git is a distributed revision control system that advocates collaboration at a wide scale and supports offline development work. Its killer feature is branching and merging of projects with ease and speed. This style of code management is ideal for parallel development work used by open source projects.

In Part II, I present a set of Git commands analogous to the way you work with TFS projects, in order to get you started right away with Git.

]]>http://blogs.ancestry.com/techroots/migrating-from-tfs-to-git-based-repositories-part-i/feed/0Ancestry.com to Present Jermline on DNA Day at the Global Big Data Conferencehttp://blogs.ancestry.com/techroots/jeremy-pollack-to-present-jermline-at-the-big-data-innovation-summit-on-april-10th/ http://blogs.ancestry.com/techroots/jeremy-pollack-to-present-jermline-at-the-big-data-innovation-summit-on-april-10th/#commentsWed, 09 Apr 2014 22:57:40 +0000Jeremy Pollackhttp://blogs.ancestry.com/techroots/?p=2292Interested in genealogy? Curious about DNA? Fascinated by the world of big data? If so, come check out my talk at the Global Big Data Conference on DNA day this Friday, April 25 at 4pmPT in the Santa Clara Convention Center! I’ll cover Jermline, our massively-scalable DNA matching application. I’ll talk about our business, give a run-through… Read more

]]>Interested in genealogy? Curious about DNA? Fascinated by the world of big data? If so, come check out my talk at the Global Big Data Conference on DNA day this Friday, April 25 at 4pmPT in the Santa Clara Convention Center! I’ll cover Jermline, our massively-scalable DNA matching application. I’ll talk about our business, give a run-through of the matching algorithm, and even throw in a few Game of Thrones jokes. It’ll be fun! Hope to see you there.