Bug Fixes

Highlights

We added support for XML sitemaps that are located in non-standard locations within a domain.

We added sort_by support to our Results API

Chores

We finished migrating to CircleCI for our continuous integration monitoring.

We improved our internal tracking of queries to the Bing API.

We improved how we handle indexing domains that time out.

We began indexing the last-modified date of a page, if provided

Our SitemapIndexer now processes one sitemap at a time, and we created an automated queue for indexing jobs and url fetching.

We improved the management of Searchgov domain states. Now each Searchgov domain has an “indexing activity”. States might include: indexing sitemaps, fetching new URLs (such as after bulk import), and crawling.

We now follow client-side redirects.

We improved our ability to avoid certain crawler traps.

We now index documents up to 15 MB in size. The previous limit was 10 MB.

We finalized our compliance with BOD 18-01.

We cleaned up how we handle temp files during indexing.

We tidied up our internal errors on indexing jobs, as well as our test suite.

Bug Fixes

We fixed a bug that was not showing diacritics properly in non-English searches.

Highlights

We continue to make good progress towards our indexing system, and continues to be highly focused on the back end of our system. See below for more details.

Chores

We created back-end interfaces allowing the Search.gov team to manage indexed domains & urls.

We added a delay method to SearchGov Domain, to honor the crawl delay settings in a given site’s robots.txt file.

We created a SearchGov Domain Indexer job that will enqueue urls in need of fetching, to allow bulk indexing tasks to be automated without overloading anyone’s servers, and we added support for resque-scheduler to our configuration baseline.

We set the sitemap indexer to reject urls from other domains to avoid erroneous attempts to index content from beta sites, old domains, etc.

We now check the protocol of a domain, and whether the site is responding to us. We also set our url fetcher to throw an error if the domain is unavailable or blocking our indexer.

We re-indexed the searchgov indices.

We upgraded mySQL in demo environments, and streamlined the scenario data for our test suite.

Bug Fixes

We fixed bug that sent searchers back to page 1 results when changing the time scope in a Collection search.

We mitigated SSL certificate problems with some sites.

We made our redirection check more strict to avoid filling our database and indexes with domains and web pages that don’t need to be searchable.

Highlights

Our new indexing system has been in production since December! In January, we released several features that are building on and improving our new system:

We updated our index with new stemming settings. Stemming refers to how a system processes related words, based on the root of the words. For example, stemming is what allows a search for “renew passport” to show results for “passport renewal”.

URLs in our index can be permanently deleted from our system.

Documents in our index are now limited to 10 MB in size.

We can extract body text from a document if the <main> element is empty.

Chores

Deprecations

The Instagram section is no longer displayed in the Admin Center dashboard, unless you had an Instagram account added to your search site prior to June 2016. At that time, Instagram began requiring accounts to grant permission to index their images via an integration between systems, which Search.gov cannot support. Therefore, our Instagram index was last updated in June 2016. Any images in our index prior to that date will continue to be shown on your search results page, as long as you do not remove your Instagram account from the Admin Center. If you remove your account, any photos in our index will be permanently deleted from our system.

Drilldown tables and graphs are no longer available in the Monthly Reports section. According to our analytics, the tables and graphs were not being viewed. We anticipate rolling out new analytics viewing options later in 2018.

Highlights

Our new indexing system is now in production! Our team has been hard at work on this effort since June, and we are thrilled to reach this exciting milestone. In December, we released several features that helped us cross the first phase finish line:

Our new system is live with an updated version of ElasticSearch.

Our new system takes into consideration a “Promote” value when determining relevancy. “Promote” is a true/false value and is optional.

Our technical lead improved the way the Loofah scraper gets HTML documents into our system. We work in the open as much as possible, and this minor change helped fix a large bug in the Loofah core code.

Chores

The endpoint for our Jobs API was updated on December 7th. This change puts the hostname under Search.gov’s DNS zone; previously, it was hosted in another part of our division. This code change only affected agencies that are directly calling our open source API. If you are only using our Jobs Module on your hosted search results page, you did not need to take any action.

We updated Rails on our Jobs API.

We updated Ruby on our main application.

We transitioned away from using UserVoice to collect feedback from our customers. Instead, you can submit feedback via Google form or by emailing us. Take a moment to review the feedback we’ve already received.

Highlights

To accomplish our FY 2018 goals, we continued backend development that will allow your agency content to be served directly from our indexes. In November, our team began testing our new system. We also released several features related to this project:

Collections results will now come from our indexes. Previously, a site that saw main page search results from our indexes would still see Bing or Google results when using Collections. Now, if a site is using our indexes for its main search page, its Collections results will also come from our indexes.

The page-1 RSS module and search page alert will now appear for sites getting results from our indexes. Previously, these two features only worked on sites using Bing/Google.

i14y documents will be rejected if the document_id contains slashes or is more than 512 bytes.

Chores

On November 14th, we notified our users of a Bing service degradation. This caused inconsistent results across sites, including incomplete results or 503 errors, and prevented access to the Search Admin Center. The inconsistency began at 2:05pm ET and ended at 3:05pm ET.

We continued transitioning our repos to use Circle CI.

Fixes

Phrase searches now work with content that is served from our indexes. A search for “cheese curds” will return results for the specific phrase “cheese curds” rather than the separate words “cheese” and “curds”.