While I was in Chicago for a business trip, my favourite activity was riding the rental bike around the Lake Michigan. In a bright autumn afternoon with few cloud and the air cools down, it’s definitely a perfect time to rent a bike to explore an unknown city as a tourist. However, it’s quite frustrated while millions of tourists are renting the bikes. It’s unpleasant if there are no bike left and also too many bikers sharing a tiny bike lane. To find a nice fall afternoon and not many people on the road make a perfect sense while you’re a tourist in an exciting city you wanna explore.

As a city bike sharing manager, you wanna share the as many bikes as with potential riders, and I am sure you will have such concerns:

但是对于一个自行车租用系统的管理者来说，他担忧的内容又完全和游客考虑的内容是不一样的。对于管理者他可能会有以下的顾虑：

How many bikes are actually needed in the city bike sharing system? 我们的城市自行车租用系统到底需要多少自行车？

If the bike demand varies every day according to the temp, weather, holiday, and humidity?每一天自行车的租用情况随着气温，天气，假期和空气湿度等等的变化是如何变化的？

It will be most cost-efficient that the city won’t provide too many bikes and it’s important not to run short.所以对于一个管理者来说，用经济的自行车租用数量最好的服务市民才是最重要的。

Here, I got the data from http://kaggle.com, it about 11,000 records in the dataset. This dataset was provided by Hadi Fanaee Tork using data from Capital Bikeshare. Capital Bikeshare is a bike sharing system in Washington DC that aims to rent a bike for people who are going to Metro, to work, run errands. It has more than 3000 bikes in the system for over 350 stations across Washington DC, Arlington and Alexandria, VA and Montgomery County, MD and it could be returned to any station near your destination. I have not used bikes in DC yet and might be worth to try, it’s free for first 30mins.

Some result from the data analysis

From the graph, we could tell that more people are using bikes during the fall, and least people are biking around during the spring time.从上图可以看出来秋季是人们最喜欢骑自行车但是春季是最不喜欢骑自行车的季节。

Through the year, bike demand starts to climb after Apri and decline after Oct. The demand pick is around Sep at least from 2 years data records.

从一整年的情况来看人们最喜欢骑自行车的月份开始于4月然后到10月份就开始下降了。需求量的最高峰出现在9月份。

When I replot the data to 24 hours for working days, from the midnight to 23:00, the pattern of bike demand could be seen as 1) while the temperature rises more people are on bike; 2) there are two peaks of bike demands in a work day, which is morning time around 8am and afternoon around 18am; 3) People like to use bike during the lunch time while the temperature is warmer than 20 degrees.

However, the bike demand looks a bit different when it was a holiday. The maximum of bike demand was not that high compares to the working days, which means residents in DC area are using bikes. The demand for the holiday is more spread out than work days, and it slowly starts after 8 am when the temperature is pleasant, and the demand peak appears around 13:00 to 17:00.

From the above graphs, you might find we only have dug out the bike demands, which is label as “count” in the dataset, together with temperature (mainly). If we wanna make a prediction of how many bikes we actually need for each day, and just imagine that any condition you don’t wanna ride a bike in DC. If I only speak for myself, I don’t wanna ride a bike: 1). When it’s too cold out there (Oops, topical people); 2) too humid; 3) too windy; 4) it’s rainy hard; 5) too many people out there riding bikes.

To make a prediction like a bikshare system manager, we need to know the correlation of each pair of variables, which the pair between each of humidity, weather, workingday, windspeed, hour, holiday and so on. Therefore, I produced a graph to pair out the correlation for each pair of variables. The blue colour represents positive correlation, and red colours mean negative correlation. For example, looking at the column of ‘count'(it’s the bike demand I mentioned above), it has positive correlation with temperature (‘temp’) in the graph, which means when temperature goes up people like to ride bike, but it has negative correlation with humidity (‘humidity in the graph’), which indicates people would not like to ride a bike during a high humidity time. Note: this is a linear regression, which means I just assume each pair of variables is linearly correlated, which could not actually reflect the reality sometimes. For example, I could not bike outside while it’s too hot, but the regression tells that people would love to bike even more while it’s actually hot (with positive correlation).

I ran the linear regression between bike demands and the variables above and had this blowing regression. It will be able to help us to predict how many bikes we actually need in The Capital Bikeshare system each day, according to the weather, holiday, and temperature, etc. As a tourist, you could also predict if you wanna go out today according to the weather prediction and rough prediction of how many bikes are going to be around in the city.

At this point we could make a prediction/assumption: Today, it’s fall now, and holiday; the weather is clear, few clouds; temp is 30, but air temp is about 34; humidity is about 70%; weed seed is about 2, and it’s close to 16:00 pm now. So we could predict how many bikes are needed for the particular hour, day and weather.

The answer is 781 bikes.

My R codes could be found here: http://rpubs.com/Geoyi/BikeshareDC_LM.

Centers for Disease Control and Prevention (CDC) provided Zika virus epidemic from 2015 to 2016, about 107250 observed cases globally, to kaggle.com. Kaggle is a platform that data scientists compete on data cleaning, wrangling, analysis and provide the best solution for big data problems.

Zika virus epidemic problem is an interesting problem, so I took the challenge and coded an analysis in RStudio. However, after finishing a rough analysis, I found that this could be an example of big data analysis instead of a perfect example for CDC on Zika virus epidemic. Because the raw data has not been cleaned and clarified yet, and the raw data description could be seen here.

Zika is spread mostly by the bite of an infected Aedes species mosquito (Ae. aegypti and Ae. albopictus). These mosquitoes are aggressive daytime biters. They can also bite at night.

Zika can be passed from a pregnant woman to her fetus. Infection during pregnancy can cause certain birth defects, e.g. Microcephaly. Microcephaly is a rare nervous system disorder that causes a baby’s head to be small and not fully developed.

Firstly, let see the animation of the Zika virus observations globally. The cases observations were started recorded from Nov. 2015 to July 2016. At least from the documented cases during the period, it started from Mexico and El Salvador, and it spread to South American countries and the USA. The gif animation makes the data visualization looks fancy, but while I looked deeply, the dataset need a serious cleaning and wrangling.

While I plotted the cases by counties from 2015 to 2016, we could see most of Zika epidemic cases were observed much more in 2016 especially in South American countries. Colombia had by far the most reported Zika cases. Puerto Rico, New York, Florida and Virgin Islands of USA have reported Zika cases so far. During this data recorded period 12 countries were reported had Zika virus cases, from most reported cases to the least these countries are: Colombia (86,889 reported cases), Dominican Republic (5,716), Brazil (4,253), USA(2,962), Mexico (2894), Argentina (2,091 ), Salvador (1,000), Ecuador(796), Guatemala (516), El Panama(148) , Nicaragua (125) and Haiti (52). See the below map.

However, while I went back to organize the reported Zika cases for each country, I found the data recorded for each country was not consistent. It’s oblivious that the each country has their strengths and different constraints for tracking Zika epidemic. Let’s see some examples:

In the states, most of the reported cases are from travel. But I am confused that aren’t the confirmed fever, eye pain, headache cases overlapped with zika reported, and zika_reported travel were included in yearly_reported_travel_cases. If so, were the cases were overestimated for most of the countries. Probably only CDC could explain the data much better from medical conditions and epidemic perspective.

From the reported cases that Microcephaly cases caused by Zika virus were only founded in Brazil and Dominic Republic. Microcephaly is a rare nervous system disorder that causes a baby’s head to be small and not fully developed. The child’s brain stops growing as it should. People get infected with Zika through the bite of an infected Aedes species mosquito (Aedes aegypti and Aedes albopictus). A man with Zika can pass it to sex partners but there was a case that a woman who infected with Zika virus has been found passed the virus to her partner too.

Note: Again, this is an example of big data analysis instead of a perfect example for CDC on Zika virus epidemic, because the raw data from CDC still need seriously cleaning. For more insight, please follow CDC’s reports and cases recorded.

Project idea

Photovoltaic (PV) solar panels, which convert solar energy into electricity, are one of the most attractive options for the homeowners. Studies have shown that by 2015, there are about 4.8million homeowners had installed solar panels in the United States of America. Meanwhile, the solar energy market continues growing rapidly. Indeed, the estimated cost and potential saving of solar is the most concerned question. However, there is a tremendous commercial potential for the solar energy business, and visualizing the long term tendency of the market is vital for the solar energy companies’ survival in the market . The visualization process could be realized by examining the following aspects:

Who has installed PV panels, and what are the characteristics of the household, e.g. what’s the age, household income, education level, current utility rate, race, home location, current PV resource, existing incentive and tax credits for those that have installed PV panels?

What does the pattern of solar panel installation looks like across the nation, and at what rate? Which household is the most likely to install solar panels in the future?

The expected primary output from this proposal is a web map application . It will contain two major functions. The first is the cost and returned benefit for the households according to their home geolocation. The second is interactive maps for the companies of the geolocations of their future customers and the growth trends.

Initial outputs

​​The cost and payback period for the PV solar installation: Why not go solar!

Incentive programs and tax credits bring down the cost of solar panel installation. This is the average costs for each state.

Going solar would save homeowners’ spending on the electricity bill.

Payback years vary from state to state, depending on incentives and costs. High cost does not necessarily mean a longer payback period because it also depends on the state’s current electricity rate and state subsidy/incentive schemes. The higher the current electricity rate, the sooner you would recoup the costs of solar panel installation. The higher the incentives from the state, the sooner you will recoup the installation cost.

How many PV panels have been installed and where?

The number of solar panels installed in the states that have been registered on NREL’s Open PV Project. There were about 500,000 installations I was able to collect from the Open PV Project. It’s zip-code-based data, so I’ve been able to merge it to the “zip code” package on R. My R codes file is added here at my GitHub project.

Other statistical facts : American homeowners who installed solar panels generally has $25,301.5higher household income compare to the national household income. Their home located in places that have higher electricity rate, about 4 cents/kW greater than the national average, and they are also having higher solar energy resource, about 1.42 kW/m2 higher than the national average.

Two interactive maps were produced in RStudio with “leaflet”

An overview of the solar panel installation in the United States.

Residents on the West Coast have installed about 32,000 solar panels from the data registered on the Open PV Project, and most of them were installed by residents in California. When zoomed in closely, one could easily browse through the details of the installation locations around San Francisco.

Another good location would be The District of Columbia (Washington D.C.) area. The East Coast has less solar energy resource (kW/m2) compared to the West Coast, especially California. However, the solar panel installations of homeowners around DC area are very high too. From maps above, we know that because the cost of installation is much lower, and the payback period is much faster compared to other parts of the country. It would be fascinating to dig out more information/factors behind their installation motivation. We could zoom in too much more detailed locations for each installation on this interactive map.

However, some areas, like DC and San Francisco, have a much larger population compared to other parts of US, which means there are going to be much more installations. An installation rate per 10,000 people would be much more appropriate. Therefore, I produced another interactive map with the installation rate per 10,000 people, the bigger the size of the circle is the higher rate of the installation.

The largest installation rate in the country is in the city of Ladera Ranch, located in South Orange County, California. Though, the reason behind it is not clear and more analysis is needed.

Buckland, MA has the highest installation on the East Coast. I can’t explain what the motivation behind it yet either. Further analysis of the household characteristics would be helpful. These two interactive maps were uploaded tomy GitHub repository, where you will be able to see the R code I wrote to process the data as well.

Note: I cannot guarantee the accuracy of the analysis. My results are based on two days of data mining, wrangling, and analysis. The quality of the analysis is highly depended on the quality of the data and on how I understood the datasets in such limited time. A further validation of the analysis and datasets is needed.

I finally got my portfolio ready for data science and GIS specialist job searching. Many of friends in data science have suggested that having a GitHub account available would be helpful. GitHub is a site that holds and manages codes for programmers globally. GitHub works much better if your have your colleagues work on the same programming with you, it will help to track the codes editing from other people’s contribution to the programming/project.

I’ve started to host some of the codes I developed in the past on my GitHub account. I use R and Python for data analysis and data visualization; Python for mapping and GIS work. HTML, CSS and Javascript for web application development. I’ve always been curious that how other people’s readme file look much better than my own. BTW, Readme file is helping other programmer read your file and codes easier. Some of my big data friends also share this super helpful site that teaches you how to use Git link R, R markdown with RStudio to GitHub step by step. It’s very easy to understand.

Anyway, shot me an email to geospatialanalystyi@gmail.com if you need any other instruction on it.

I believe all of us have been watched the movie Titanic by James Cameron (1997) again and after a good sobbing, let find out if we all could survival through the Titanic. Actually, Titanic dataset is also a superstar dataset in data science that people use to do all sort of crazy survival machine learning. Today we are going to use R to answer who actually survived and what their age, sex, and social status.

The sinking of the RMS Titanic occurred on the night of 14 April through to the morning of 15 April 1912 in the north Atlantic Ocean, four days into the ship’s maiden voyage from Southampton to New York City.

This graph shows you who are on Titanic, there were more male passengers than female especially for the third class.

This is a graph show the survival comparison. Left graph shows people who did not survive and right graph show the survival counts (how many people survived). The death rate for third class passengers was super high :-(. Female passengers had high survival rate, especially for the first class.

This is also a death and survival comparison but with the age element (y-axis). From who were the survivals question you could see, the female had the highest survival rate overall, but for third class female tended to be much younger to be able to survive the tragedy. Now you know why Jack did not survive in the movie Titanic wasn’t a just tragedy itself, but it also there was the higher risk for him to lose his life in the voyage sinking.

If you remember my last blog that I present an interactive map host via ESRI ArcGIS Online. After my data was successfully uploaded, I found several issues that I don’t like about it:

Even though ESRI ArcGIS Online have a super nice format that you could visualize the spatial data in a pretty way, but the data loading from the site is very slow, AND IT’S COULD BE VERY EXPENSIVE. I am at my 60 days free trial at this point and I believe if I wanna use the server and do some data analysis on ArcGIS Online I have to buy their credits;

The way of data presenting is restricted to the certain format depends on how you select the web map format from ESRI.

I use quite a bit of R, and I know that there are two packages in R called Shiny and Leaflet For R might help me develop the idea. I was so thrilled to find these packages, I feel a bright light shine on my road and point to the destination I wanna head to, and I found a perfect example that my web map application will look like especially the case of American Super Zipcode. There are not only an interactive map but also while you zoom in and out you could also show some statistic results on the right side of the map. It’s too cool.

But I was so disappointed too while I found out developing a web application through Shiny and Leaflet for R would not be free, because I still need a server to host my data and APP once they could be share. However, at the point that I only need to test my ideas.

I gave up the two methods I found above and even checked out Mapbox Studio and Cartodb, two of the most popular online interactive map and visualization platform. But they are for developers (you still could use it without coding background, though), but I wanna have some features that require coding in Javascript. Leaflet JavaScript library is the last and best way I could use, which could give me enough freedom to figure out the functions/features for my application, and even the interactive analytical tools that I could put up over there. Now I also find D3 might be even more attractive because it hosts a bigger JS library that not only for the interactive map but also other online interactive way of data visualization.

I got a lot help from briefing through some YouTube videos (that’s the reason I recorded a video myself and hope it could be helpful to another struggling beginner like myself). Learn quite a lot of new things like GeoJson and GeoJson-vt. GeoJson is a geodata format for JavaScript, which is equal to shapefile for ArcGIS and QGIS. If your dataset is bigger than 1 M, the data loading to your website would slow down, so the founder of leaflet JS library wrote a vector tile JS codes (GeoJson-vt) to speed up the shapefile data loading process.

I’ve been working on a web application for Chinese Ministry of Commerce on rubber cultivation and risks will be out soon, and I just wanna share with you the simplified version web map API here. I only have layers here, though, more to come.

This web map API aims to tell the investors that rubber cultivation is not just about clearing the land/forests, plant trees and then you could wait for tapping the tree and sell the latex. There are way more risks for the planting/cultivate rubber trees, including several natural disasters, cultural and economic conflicts between the foreign investors and host countries.

We also found the minimum price for rubber latex for livelihood sustainability is as high as 3USD/kg. I define the minimum price is the price that an investor/household could cover the costs of establishing and managing their rubber plantations. While the actual rubber price is lower than the minimum price, there is no profit for having the rubber plantations. The minimum price for running a rubber plantation varies from country to country. I ran the analysis through 8 countries in Asia: China, Laos, Myanmar, Cambodia, Vietnam, Malaysia and Indonesia. The minimum price depends on the minimum wage, labour availability, costs of the plantation establishments and management, average rubber latex productivity throughout the life span of rubber trees. The cut-off price ranges from 1.2USD/kg to 3.6USD/kg.

We could make an example that if rubber price is 2USD/kg now in the market, the country whose cutoff price for rubber is 3USD/kg won’t make any profit, but the investors in the country might lose at least 1USD/kg for selling every kg of rubber latex.

To be able to exact big data from twitter, you have to register an API for twitter.

I installed Python3.5 and edit my Windows8.1 environmental variables setting from ‘advance computer system setting. I downloaded Tweepy (exacting data from twitter using python), and the tweepy could not be installed in my computer Command Prompt. It reminded me that I have to log in my computer as the administrator to be able to install tweepy. Of course, right?! Sometime you just lose the battle by doing something not very smart. I relogged in my computer as the administrator and problem solved.

Marco Bonzanini has written a full 7 blogs about how to do data mining from twitter if you ever interested in doing big data analysis.

I have an opportunity worked for Chinese Ministry of Commerce with ICRAF last fall, and have been studying natural rubber value chain since then. I led four technic reports on natural rubber value chain: the first report is for Thailand natural rubber value chain (please see the title);the second one is about natural rubber value chain, foreign investments and land conflicts in Cambodia; the third report is the a comparison study between Thailand and Cambodia, the biggest natural rubber producer and the emerging rubber producer; the last report will concentrate on the risks of natural rubber cultivation and investment in Asia, from geosnatially perspectives. As I mentioned in the reports that there are no winner in the natural rubber value chain: we lost biodiversity and ecosystem services from covering natural forests to rubber monoculture (upstream of the value chain); and emitted million tons of polluted air and water, and carbon dioxide back to nature from rubber processing (the midstream); at the end, without sustainable livelihood for the poor who grows rubber; and limited competitiveness in the end products market (the downstream). We should go back the source and really think about how we can improve the whole value chain, and why.

The following content is the abstract of Thailand report in English. These reports are in Chinese recently, if you are interested in the content please contact Dr. Zhuang-Fang Yi, geospatialanalystyi@gmail.com and yizhuangfang@mail.kib.ac.cn.

Figure 1. The great Mekong region and also the global nature rubber producers.

Asia supplies 93% of natural rubber demand globally. As the world No.1 natural rubber producer, Thailand has exported nearly 40% of global rubber production demands, which is 87% of its domestic rubber production. The production improvement in Thailand is not only depending on its biophysical suitability of rubber growing, but also relying on its policy supports and subsidies to millions of upstream rubber farmers. Thailand has spent about 21.3billion Baht (586million USD) from Sep. 2013 to Mar. 2014 to subsidize its rubber farmers while the price of natural rubber went down. However, lack of manufacturing and financial supports for its midstream and downstream of the natural rubber value chain, Thailand highly depends on rubber exporting to other countries, e.g. China, US, EU and Japan.

The long history of natural rubber cultivation and supports from Thai government has grown Thai rubber farmers a better rubber economic resilience cultivation systems, which is rubber agroforestry. Rubber agroforestry is a rather complex intercropping system compare to rubber monoculture. Rubber monoculture refers to the rubber plantations that only have rubber trees, and other plant species has been killed and get rids constantly by using herbicide and manual clearance. Rubber agroforestry sustains better ecosystem services and also bring more economic returns. But the labour requirement and knowledge gaps from rubber monoculture to rubber agroforestry are the main constrains for a greener cultivation system. It means rubber farmers only need to intensively take care rubber trees in rubber monoculture system, but need other knowledge and time inputs for rubber agroforestry. However, there are about 21 intercropping systems and more than 300 farms are practicing the intercropped rubber agroforestry by the rubber famers without authority supports like rubber monoculture in Thailand. Urgent research and institution support are need for rubber agroforestry in Thailand and globally.

The merging economies and natural rubber producer countries, e.g. Vietnam, Cambodia, Laos, and Myanmar in Mekong region, are following Thailand’s foot steps, only practicing rubber monoculture, that highly support its upstream value chain but lack of rubber manufacturing and supporting financing systems for mid-stream and downstream. It leads to heavily depend on Chinese and the rest of world rubber demands. It leads to very weak economic resilience for millions of smallholding rubber farmers when the price goes down. In China market, rubber price dropped from 6.3USD/kg to less than a dollar in 2014. China, as the biggest natural rubber importer, consuming nearly 40% of global rubber supply. On the other hand, 20% of imported taxes are charged and have dramatically increased the cost of rubber end products, and loss its global competitiveness in the natural rubber market. There are no winner in the natural rubber value chain: we lost biodiversity and ecosystem services from covering natural forests to rubber monoculture (upstream of the value chain); and emitted million tons of polluted air and water, and carbon dioxide back to nature from rubber processing (the midstream); at the end, without sustainable livelihood for the poor who grows rubber; and limited competitiveness in the end products market (the downstream). We should go back the source and really think about how we can improve the whole value chain, and why.

While more and more Chinese state-owned and private enterprises follow “Go Global” strategy by Chine central government who have heavily invested outside of China. Natural rubber end products, especially tires industry is one of them. In this reports, we scrutinized the natural rubber value chain in Thailand and its foreign investments , especially Chinese investments. We tried to answer:

If there are the best rubber cultivation systems that combine economic returns and a better ecosystem services supporting system;

Figure 2. Thailand as the biggest rubber producer, produce 4.5millions ton of natural rubber, and 80% of Thailand domestic natural rubber is from Southern Thailand. Each polygon represents of a province in the map and the darker of the color represents the bigger area of rubber cultivation.