Pagination

Affiliates Connect Project

This module provides an interface to easily integrate with various affiliate APIs or product advertising APIs provided by different e-commerce platforms like Flipkart, Amazon, eBay etc to fetch data easily from their stores and display those to monetize your website by advertising their products.

Affiliates Connect Demo

I have created a demo website where you can test the complete module and its submodules including plugins. Everything is configured so you can test it out. Link to the website - affiliates-demo.ankitjain28.me

Welcome to my Google Summer of Code 2018 week 12 summary update of Project Affiliates Connect. Learn more about my previous update in Week 11 and in my archive. GSoC (Google Summer of Code) is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 3-month programming project during their break from school.

Progress

Some of the tasks accomplished in this week are -

- I have added the feature of the Live Browse of products through Entity Browser, I have added the feature in eBay Plugin to properly respond for this feature by adding the search functionality in eBay Plugin. Link to the patch - #2988942

- I am fetching the Image URL through Scrapper API as well as from the Native API so it is defined as the Link on the Product View Page which should be shown as an image rather than the Link. I have added the "Link to Image" Field Formatter Plugin which will display the image as per the size which is given by the user through the "Manage Display" page of the Product entity. Link to the patch - #2990538

- I have added the Cloaking URL but it will be displayed as the String rather than the Link which is initiated through Javascript. I have added the "Clock URL" Field Formatter for the same which will be displayed as a Button/Link on the page. Link to the patch - #2990538

- Product View Page has URL "/admin/structure/affiliates_connect/product/{affiliates_product}" which is changed to "/affiliates_connect/product/{affiliates_product}" so that product view page is exposed to the menus, sidebars, blocks and more without being a part of admin theme.

- I have fixed the bug which is crashing the import of products through feeds module due to the issue in the Feeds Tamper Plugin. Along with this, I have added some required tamper plugins which were actually responsible for the crashing on getting a null value. Verification of issue - #2983197, Link to the patch - #2984281

Week 12 - Goals

- I will work on displaying similar products of different e-commerce websites from the site database. It is important that the product should be precisely the same as that the user is looking for which is a difficult task as most of the e-commerce sites neither have the same name or attributes. I am using Machine Learning Approach for the same.

- I will also work on adding the documentation for the project.

- I will also work on providing the distribution for the Affiliates Connect.

Conclusion

Providing the similar products is quite a difficult task as I didn't have much knowledge in the field of Machine Learning but still, I will try my best to make it complete.

Welcome to my Google Summer of Code 2018 week 11 summary update of Project Affiliates Connect. Learn more about my previous update in Week 10 and in my archive. GSoC (Google Summer of Code) is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 3-month programming project during their break from school.

Progress

Some of the tasks accomplished in this week are -

- As products are stored in the custom content entity 'affiliates_product'. Drupal views module allows administrators and site designers to create, manage, and display lists of content. I used views to show the products data along with sorting/filtering and it works perfectly, There is no need to configure anything programmatically. We can also display the products in Blocks. Users can design the views as per their choice.

- Searching feature to search the products is one of the important functionality that site admin requires. Drupal Search API module provides the powerful search functionality through indexing of products in the database. Search API module uses Apache Solr to improve performance. I used Search API database to implement the search technique. It's working good too. Link of a demo video - Video

- Drupal contrib module entity_browser is a great utility to provide a generic entity browser/picker/selector. Our module comes with the 'Affiliates Product Entity Browser' which has two main features-

1. It lets users select/pick the products through an entity_browser view functionality. Entity Browser uses View to display the list of entity data in the form of a table and expose search, sort, and other filters.

2. In the above point, we cover the products that the user has already imported to the database. Now the questions arise, what about the products that are not imported in the site database? Either the user has to first import the products before creating the node/content or anything. Ahh, that's bad. So, Here we have a solution for the above problem, Along with the selection of products from the database, Our module comes with the Live Search feature, It lets the user search the products directly from the external vendors using Native API. In this way, A user can select/pick the appropriate product without going anywhere and can also change the data of the selected product. Link to the demo video - Video

- Live Search feature is also provided separately to search the products from the vendors and import them in the site database. Link to the patch - #2987416

Week 11 - Goals

- I am working on the product comparison feature. Product comparison is the side by side comparison of products data.

- I will work on improving the performance of the Node Scrapper. I have exposed the API of the Nodes Scrapper which is called by Feeds module sequentially which resultant in low performance.

- I will also work on cloaking feature. Cloaking feature is almost complete, but to make sure that any crawler or bots won't able to detect the redirected URL, It is necessary to identify whether the request is made through valid user or any bots.

Conclusion

Improving the performance of the Node Scrapper is really difficult as Feeds module workflow is sequential, If I make the workflow somehow parallel can really help to improve the performance issue. It would be great if community members can help me out by giving better suggestions for it. I am looking for the solutions of above-stated problems.

Welcome to my Google Summer of Code 2018 week 10 summary update of Project Affiliates Connect. Learn more about my previous update in Week 9 and in my archive. GSoC (Google Summer of Code) is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 3-month programming project during their break from school.

Progress

Some of the tasks accomplished in this week are -

- The Affiliates Connect Ebay plugin is implemented and is under review. This plugin uses developer APIs of eBay to search products by two methods - findItemByKeyword and findItemByCategory. A user can search products by above two stated methods and can also import the selected items(products) to his own Drupal site database or he can also import all the products up to 100 pages (limitation of eBay APIs). Link to the issue - #2985626

- eBay is an international e-commerce website similar to Amazon so it has multiple websites and programs according to the Locale. We have taken care of that too. You can select the locale under the configuration/settings tab and search or import products from a particular locale. Link to the issue - #2985626

- Functional tests for the module is also added which is verifying the availability of routes and instantiating of the plugin. Also verifying the configuration form. Link to the issue - #2985702

- Cloaking feature is also implemented and is under review. Cloaking allows the user to hide the Affiliate id/tag from the Affiliate URL. People generally use Cloaking to hide the affiliate id so that visitors won't know that it's an affiliate URL or to prevent themselves from the commision theft.

- Cloaking comes under the BLACK HAT SEO as it affects the SEO largely due to redirect URL to external sites. Crawlers and bots find the redirect URL so, to prevent crawlers and bots from this, we have added some features which is configured from the settings page.

- Most of the crawlers and bots respect the robots.txt file and follow it So, I have provided the option to allows a user to add the cloaked links to the robots.txt.

- Along with that, Adding the rel="nofollow" tells the crawlers not to follow this URL or link.

- Some of the crawlers and bots won't follow the robots.txt file and nofollow rules, So I have used Javascript to implement Cloaking. All the redirect links are encoded using base64 so it won't look like URL or links by the crawlers. When the link is clicked, a Javascript function is called, which decodes the base64 string and calls the cloaking URL from there user is redirected to the original URL. Link to the issue - #2985796

- Implemented the hits analysis feature, A site admin gets to know about the valid hits to the cloaking URL which can actually help me in analyzing the type of products which gets more numbers of visitors so he can modify the products accordingly. He can reset the hits too. Link to the issue - #2985796

Week 10 - Goals

- Creating configuration page for affiliates_connect_searchpage which contains the configuration for using a search_api module for indexing and searching data in the Drupal database.

- Implementation of the Affiliates Connect Browser widget to select the product from the site database for Advertising on the Drupal site.

Conclusion

It is required to use Search API for better performance of searching the products from a database as the database will have thousands of products. Browser widget will help the site admin to select the products for the advertisement. In this way, he can manage the products from different vendors and according to the demand of his site visitor through the analysis of hits.

You can know more about my last week work in this blog post - Week 8. This blog post is for Week - 9. GSoC (Google Summer of Code) is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 3-month programming project during their break from school.

Progress

Some of the tasks accomplished in this week are-

- Default configuration for Amazon Scraper is completed and under review. Scraper API is using Feeds module along with Feeds_ex and feeds_advance_crawler module and it will be difficult for anyone to set up the configuration for scraping so amazon plugin will come up with default configuration plugin which provides the one feed-type configured for scraping data from Amazon. Default configuration is provided as a module under the Affiliates Connect Amazon Plugin. Uninstalling the module won't remove the configuration and it is recommended to uninstall the default configuration module to improve the performance. Link to the issue - #2983176

- Native API for the Affiliates Connect Flipkart Plugin is completed and under review. Flipkart Plugin allows a user to import the 500 products from the 53 Categories. A user can import products from selected categories too. Products Import is implemented through Batch processing so request won't suffer timeout issues. Link to the issue - #2984461

- Native API for the Affiliates Connect Amazon Plugin is also completed and under review. Amazon Plugin allows a user to search for products through ASIN (Amazon Standard Identification Number) which is unique for every product and allows him to import the product data to his site database too.

- Along with the Search through ASIN, a user can also search the products through keywords from categories according to Locale. Amazon is currently available in these Countries (Brazil, Canada, China, Spain, France, Germany, India, Italy, Japan, Mexico, United Kingdom, United States) so Product Advertising API allows importing of the products on the basis of locale. A user can also import all the search products to his site database by configuring the setting from the Plugin Setting page. Settings allow a user to update the product's metadata if the product exists already in the site else create the new product entry. Link to the patch - #2979372

- Some of the issues resolved along with the major issues including the restructuring of Product entity routes and form fields, Added the PluginBase class and methods in the Product's entity. Link to the patches - #2984462, #2984980, #2982833

Week 9 - Goals

- This week, I will work on the Native API of the Ebay, Ebay Product Advertising API doesn't allow a user to import the products similar to Amazon so, I will implement the Search for the Products feature and import search products to the site database.

- I will also work on Cloaking feature. Cloaking affiliates URL hides the display of a user's affiliates-id. In Cloaking, we provide the custom link in-place of every valid link which redirects the user to a valid URL through the database querying for the appropriate link. In this way, we can also count the number of valid clicks.

Conclusion

- Ebay Native API is similar to Amazon but I am finding Ebay documentation for Product Advertising API hard so it's a little tough but it's not impossible. Along with it, There are some tasks in Amazon native API that is still left, Like the import of Search products are facing the Request Throttling issue. Amazon sends only 10 products per request and Each Search Keyword has around 50,000+ products, It means 5,000 pages => 5,000 requests. So it is suffering through Request Throttling. Even providing the gap of 5 sec between two requests isn't helping out. It will be a great help if someone has a better solution to the throttling issue.

You can know more about my last week work in this blog post - Week 7. This blog post is for Week - 8. GSoC (Google Summer of Code) is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 3-month programming project during their break from school.

Progress

Some of the tasks accomplished in this week are-

- Scraper API is almost completed with both Static and dynamic fetchers. Static fetcher was completed last week and dynamic fetcher is also completed this week. Dynamic fetcher is using Nightmare for loading the Javascript and content that JS loads. Both the fetchers have the features of crawling, inner fetching and break fetching. Link to the repo - advance_crawler

- The project is dockerized and along with that, the project can be individually used by other developers for scraping. As Drupal is interacting with the Node through REST API architecture, it is not dependent on Drupal and its module in any manner Hence can be used in other projects. Link to the repo - advance_crawler

- I have added the multiple User-Agents features to avoid blocking, I was working on Proxy but I didn't find any open source/free proxy service and third party proxies can result in the performance issue until it is not a premium proxy service.

- I have also fixed the changes suggested by dbjpanda regarding the restructuring of routes. It is under review. As a part of this issue, it was suggested to add the predefined fields along with user-defined fields under the "Manage Fields" section. But it is not possible to show the predefined fields under the Manage fields Section. You can see that it is neither defined in Taxonomy nor in Content-Type but Content Type in Basic Article has 'body' field which is defined in the "Manage Fields" tab because it is not defined as a predefined fields i.e under BaseFieldDefinition function - Here you can see - Drupal 8.6.x Node.php It is actually defined as a field.storage.node.body.yml file in the install dir under config folder. Link to the issue - #2982833

Week 8 - Goals

- Provide default configuration for Amazon Scraper. Scraper API is using Feeds module along with Feeds_ex and feeds_advance_crawler module and it will be difficult for anyone to set up the configuration for scraping so amazon plugin will come up with default configuration plugin which provides the one feed-type configured for scraping data from Amazon. Link to the issue - #2983176

- I will work on the Native API part for this module. Amazon Product Advertising API won't allow the data to be imported from the Amazon. It only allows users to search for any product through ASIN (product_unique_code) and sends the product information. Flipkart Product Advertising API on the other hands allows a user to fetch the 500 products from each category (Flipkart has around 58 categories).

Conclusion

Setting the default configuration is a bit tough as I have to go through the Amazon DOM structure in a proper way. As Amazon has multiple DOM structure for some pages like Special Page for the new Launch of any mobile. So it will be difficult to scrape the 100% products from the Amazon but 80% of the products can be scraped through it. These are some challenging task but it's not impossible.

Week 6 of the GSoC coding period is completed successfully. GSoC (Google Summer of Code) is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 3-month programming project during their break from school.

Project Abstract

I am working on "Developing a “ Product Advertising API ” module for Drupal 8" - #7. The “Product Advertising API” which is renamed to "Affiliates Connect" module provides an interface to easily integrate with affiliate APIs or product advertising APIs provided by different e-commerce platforms like Flipkart, Amazon, eBay etc to fetch data easily from their database and import them into Drupal to monetize your website by advertising their products. If the e-commerce platforms don't provide affiliate APIs, we will scrape the data from such platforms.

Progress

Some of the tasks accomplished in this week are-

- Scraper API is using feeds module for leveraging its functionalities and provide a generic scraper to scrape content. And by last week, I have already completed the Fetcher for my module and working on the pagination. The requirement of our module is to scrape all the relevant content related to the product, not just the content that is shown on the product overview page. But paginate (Feeds crawler) can only work on the product overview page. So to overcome this problem, I come up with new terms like "Inner Fetching/Scraping" and "Break Fetching".

What Inner Fetching and Break Fetching do is, Through Inner Fetching, We scrape products link from the product overview page and goes to each product page and scrape its inner Content. We use Nodejs server for scraping hence, we sent multiple requests in parallel and append each product page HTML content to their respective overview page inside tag <affiliatesconnect>. Sometimes, Feeds fetcher gets break due to scraping a large amount of inner links so we provide a user an option to config maximum number of products pages to fetch and in such way a user can fetch links in small divisions.

A small demo video to explain this functionality can better understand it. Link to the video - Feeds Advance Crawler

- After discussion with the mentors, We decided to provide this Fetcher as a separate module as it completely depends on the Feeds module and uses Feeds Extensible Parsers module for parsing using XPath HTML and QueryPath HTML parsers. In this way, people can better use its feeds crawling (That is not yet in D8 only in D7), inner fetching and Break fetching functionalities. It is completed and reviewed by dbjpanda and thedrupalkid. Link to the module - Feeds Advance Crawler

- I have added the functional tests for the instantiation of the modules and its configs. It is reviewed by borisson_ , dbjpanda. Link to the issue - #2979801

- I have completed the Nodejs Scraper that is powering our Feeds Advance Crawler which all such features. It is still under review by shibasisp. Link to the Pull request - Node Scraper

Week 7 - Goals

- Scraper API is using Nodejs Server for scraping and I have provided the static fetcher that fetches data from a static website. I need to provide the dynamic fetcher that fetches data from dynamic websites that use Angular and React lib to load content. For dynamic fetcher, I am using Nightmare and reasons for using nightmare is stated in this blog. Link to the blog post - GSoC Week 5 blog post

- As we have to fetch millions of products so it is necessary to implement multiple user agents and proxy rotation to avoid blocking by the e-commerce websites.

- Feeds Advance Crawler has a lot of configuration related to the mapping of fields with the correct source which will be difficult for users with no knowledge of QueryParser and Xpath. So I am working on providing some default preloaded configuration that can help him in creating the same configuration for other e-commerce websites.

Conclusion

Setting up default configuration for the user and provides the copy of that configuration in a single click similar to views can be a bit tough and requires some research. My previous research over Nodejs Scraper libs and speed tests can be a great help in developing dynamic scraper.

Week 5 of the GSoC coding period is completed successfully and with this, the first evaluation is also completed successfully. GSoC (Google Summer of Code) is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 3-month programming project during their break from school.

Project Abstract

I am working on "Developing a “ Product Advertising API ” module for Drupal 8" - #7. The “Product Advertising API” which is renamed to "Affiliates Connect" module provides an interface to easily integrate with affiliate APIs or product advertising APIs provided by different e-commerce platforms like Flipkart, Amazon, eBay etc to fetch data easily from their database and import them into Drupal to monetize your website by advertising their products. If the e-commerce platforms don't provide affiliate APIs, we will scrape the data from such platforms.

Progress

Some of the tasks accomplished in this week are-

- Native API for affiliates_connect_flipkart module is integrated but it needs to be reviewed. As we decided to create separate module for Flipkart and other E-commerce similar to Social Auth. I have used Batch API for fetching products data from Flipkart Affiliate API so that we can fetch all the products without facing any error else our request will get end due to exceeding max_execution_time in php. Its code is on Github, Link to the repo - affiliates_connect_flipkart

- Fetcher for the Scraper API is complete. I have created two fetchers (StaticFetcher and DynamicFetcher). StaticFetcher and DynamicFetcher leveraging the power of Feeds module and save the content of URL in a temporary file which is further passed to the parser for parsing. This is under review and Link to the issue - #2979094

- Fetcher will need the Nodejs Server as it is using Request lib for fetching the content from the URL and nightmare for websites which use JS to load DOM. It code is under this repo - Scraper

Week 6 - Goals

- In this week, I will work on adding the pagination to the fetcher so that we can crawl and scrape multiple pages of the same category/type. In this way, we need not to create feeds for every single page which is impossible for anyone to do. Link to the issue - #2980450

- I will also add functional tests for the checking whether the plugin (affiliates_connect_amazon) is instantiated properly after installing and other tests for the configuration form. Link to the issue - #2979801

- I will try to complete the parser this week so that we can make good progress and complete the scraper part of the module.

Difficulties

- Paginating every page is difficult but as feeds module is using Batch API and as in this issue - #2979249 it is mentioned - "Feeds asks the fetcher if there is something to fetch. With the returned result from the fetcher, it asks the parser to parse it. This is results in a number of items and for each item, Feeds asks the processor to process it. After all, items are processed, Feeds asks the parser again if it has more to parse. If so, this would result in a number of items again where each of them is passed to the processor. If the parser has nothing more to parse, Feeds will ask the fetcher again if it has more to fetch." So It's not going to be difficult.

- Parser to parse the result of the fetcher and pass the result to the processor will be challenging.

Week 4 of the GSoC coding period is completed successfully and with its completion, first evaluation period is also started. GSoC (Google Summer of Code) is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 3-month programming project during their break from school.

Project Abstract

I am working on "Developing a “ Product Advertising API ” module for Drupal 8" - #7. The “Product Advertising API” which is renamed to "Affiliates Connect" module provides an interface to easily integrate with affiliate APIs or product advertising APIs provided by different e-commerce platforms like Flipkart, Amazon, eBay etc to fetch data easily from their database and import them into Drupal to monetize your website by advertising their products. If the e-commerce platforms don't provide affiliate APIs, we will scrape the data from such platforms.

Progress

Some of the tasks accomplished before the first evaluation period, This post comprises of tasks completed from first week of GSoC to first evaluation period.

- With the skeleton issue, The overview page which will show the different plugins enabled by the user and allow the users to configure plugins, is also completed.

- Content Entity for storing product's data from various vendors is also completed and reviewed by borisson_ , dbjpanda , and other mentors. It is still under discussion whether to select nodes or content entities to store the data. I am thinking to use content entity for storing the product's data and why I am using it, is also explained under this blog post - Where to store product's data? Link to the issue - #2975642

- Functional Tests for verification of routes defined in the project as suggested by borisson_ is also completed and reviewed. It also included the functional tests for checking whether product's data is submitted correctly by affiliates_product add form or not. Tests for deleting & editing the products are also completed under this issue. Link to the issue - #2977377

- Native APIs provided by e-commerce sites only allow some percent of product's data to be accessed. I am also working on Flipkart module in this repo from my previous studies in this repo. I will update my further work in the repo. We can't fetch all the product's data from any e-commerce sites so we need to write the scraper which does this task for us. For scraping, I am using Node.js to create scraper APIs.

- For scraping millions of products from sites, we need to design the scraper that can give us high performance in terms of data and time, So i tested various nodejs libraries like request, zombie, puppeteer, phantom, and nightmare. I found that zombie is worst out of all, in case of websites which contain heavy javascript. Puppeteer and nightmare are taking equal time as phantomjs but less complex, fast and flexible to use. Request is best out of all but it is only for static websites unlike others which can also be used for websites which uses javascript to add content dynamically. Here are the speed benchmarks for the same.

Week 5 - Goals

- I have broken down the scraper API part into smaller issues, so it would be easy to implement and review. In this way, we can progress faster. I have added one issue to the issue queue. Link to the issue - #2979094

- As discussed with mentors, We want something that can ease out the efforts at the end of the client side, So I am thinking to use Feeds module and utilize its feed import functionality. Feeds has config entity "feeds type" that allows user to use the type of fetcher, parser, processor and maps the fields of the selected proceessor. Once that feeds type is created, We can use it for multiple urls/categories (another entity defined as content entity for saving urls linked with feed types).

- Feeds provides HTTPFetcher that use Guzzle for collecting response from the url entered by the user. As a part of this module, I am developing two Fetcher plugin named as StaticFetcher and DynamicFetcher. StaticFetcher is used to collect the response from static websites that will use Request/x-ray lib of Nodejs and DynamicFetcher for collecting the response from the dynamic websites (websites that use javascript to load the DOM) that will use Nightmare lib of Nodejs

- Feeds provides parser for parsing and mapping the data collected from the fetcher part. So I am designing the separate parser that will use cheerio lib of Nodejs for parsing the data.

- In this week, I am going to create the fetcher part. In this way, we can provide a user a generic scraper that can be configured as per the user requirement.

While working on the module "Affiliates Connect" as a Google Summer of Code student under Drupal, we encountered a situation where we have to store the product's data in the Drupal database and we are confused between nodes and custom content entities.

Project Abstract

I am working on "Developing a “ Product Advertising API ” module for Drupal 8" - #7. The “Product Advertising API” which is renamed to "Affiliates Connect" module provides an interface to easily integrate with affiliate APIs or product advertising APIs provided by different e-commerce platforms like Flipkart, Amazon, eBay etc to fetch data easily from their database and import them into Drupal to monetize your website by advertising their products. If the e-commerce platforms don't provide affiliate APIs, we will scrape the data from such platforms.

About Nodes and Content Entity

Most content on a Drupal website is stored and treated as "nodes". A node is any piece of individual content, such as a page, poll, article, forum topic, or a blog entry.

Entities, in Drupal, are objects that are used for persistent storage of content and configuration information. Entity Types are used to define custom types of data, content, or configuration which can be used for specific purposes. The default Content or Node type is an example of an Entity type

Why Content Entities over Nodes?

There are many occasions when it makes sense to create your own entity type, by which you can have ultimate flexibility and control on every aspect of data, from display, to saving and editing, custom properties, and integration with other entities. Sometimes you want to store data that will primarily be used in calculations or for storage and is not designed to be the main content item of a web page. Creating your own Entity types allows you to manage the spectrum and synthesis of data vs. content in a website with flexibility and with an eye toward optimized performance.

The default content or Node is also an example of an Entity but Comments/Taxonomy are not defined as nodes, They are defined as an entity. So, Saving product's data as an entity can have various advantages :

- It gives us flexibility and control over the data.

- Many other modules nodes in the most cases, too many contrib, and core modules can integrate with it, which can lead to unnecessary overhead.

- We can use views and search_api_solr and other modules with entities too.

- Nodes are designed to use the power of Drupal, not for developing anything.

- To store millions of product's data as an entity can help us in performing more operations later and adding more features to our module.

- Most of the e-commerce modules in Drupal like Drupal Commerce uses content entities to store product's data.

After discussing and reading about both of them, I am also thinking to use content entities for storing product's data. Custom Affiliates Product Entity for storing product's data from various vendors is completed and reviewed by borisson_ , dbjpanda, and other mentors. Link to the issue - #2975642.

The issue is still open for healthy discussion if anyone has any questions regarding the same.