9 Motivation With the internatioanl trade and commerce being of more importance, it's necessary not only for a specific group of industries but also for all of us to gain daily access to the updated currency rates. With more convenience in getting the exchange rate information, which hopefully will be provided by our project, people are equipped with more related knowledge to make better decisions regarding to the currency market. 9

10 Dataset, Algorithm, Tools Dataset: Instant exchange rates from Bloomberg History exchange rates from Algorithm: We would forecast the exchange rate of currency in some targeted countries against US Dollar in a developed market, applying scalable model to forecast in real-time. And we would like to use RMSE to measure the reliability and accuracy of our prediction. Besides, by showing the statistical significance, such as P values, the repeatability of the outcome will then be proved. Tools: Eclipse Tomcat Apache 10

11 Progress and Expected Contributions Expected Contributions: Our project is initially designed to provide users with the following contents. Forward and cross exchange rates for most world currencies Both instant and history data Basic analysis of exchange rates including regression and K means Latest news about exchange rates Current Progress A sketch webpage design that looks something like Key Factors Affecting Exchange Rate PPP(Purchasing Power Parity) INT(Interest Rate Differential) Such as Libor GDP(The Difference in GDP Growth Rates) IGP(Income Growth Rate) Relative Economic Strength we may use factors like GDP and IGP to measure it quantitatively 11 E6893 Big Data Analytics Lecture 12: Final Project Proposal

13 Motivation Improvement of internet speed and storage Data flooding everywhere Hadoop be able to do this & => MapReduce is the tool for parallel computing of the data Usage of Algorithm trading: Algorithm trading is the use of computer programs for entering trading orders, in which computer algorithms decide on every aspects of the order, such as the timing, price, and quantity of the order. 1. Back test the algorithm using enough historical price data to validate and optimize the algorithm in terms of profitability, stability, etc. 2. For a complex algorithm, there are many parameters that need to be optimized. 1. The system will builds and select most suitable strategy 20% faster than before 2. The enhanced platform doubled the number of strategy groups. 3. Last, the strategies can now be updated more frequently and can include more parameters in the analysis. 13

14 Dataset, Algorithm, Tools The dataset we choose is from stock future index of Shanghai Future Exchange (big bull market of A share ) or others Algorithm describes: The algorithm is given as combination of moving average convergence divergence(mcad) and Relative Strength Index(RSI) MCAD: long average period and the short average period RSI: the upper threshold, the lower threshold and the calculation period The project will divided into two part: First one is MapReduce platform----hadoop Second one is the trading strategies to test the data on this platform 14

15 Progress and Expected Contributions Inner MapReduce: Input: Daily price data. Each line contains 100 days price information Output: The performance of the parameters on the data Outside MapReduce: Input:Parameters combination: Each line contains one combination of parameters Ouput: The best parameters The contribution as Algorithm trading can provide many usage in investment strategy, including market marking, inter-market spreading, arbitrage, or pure speculation. With MapReduce, it can achieve faster, multi-tasks, and more real time updates. 15

17 Motivation - With the advent of social media the number of mobile pictures being taken and uploaded is increasing exponentially - Although most photos are uploaded with some basic metadata: date, time, camera model, and possibly geo-location - a great deal of details are missing when they enter the cloud. Unless users physically go through and tag each image this can create a search nightmare - Example: How do you find that picture you took a few years back while on vacation in Paris? It was under a bridge by the river right? Resorting to clicking through hundreds of photos or waiting for images to cache on your phone can take forever when you want to show someone in a pinch. Challenge In order to make more effective image search it will be important to develop and utilize advanced algorithms to help auto-tag images. Doing so can help narrow down image search and improve the quality of search results. - Leveraging the Yahoo Labs Flickr dataset we plan to test and develop upon feature extraction methods utilizing a parallelized computing system to efficiently extract image characteristics. - Using these image characteristics we will train and test the image classification of these images and evaluate them based on precision. Going beyond this step we also plan to experiment with a GPU powered processing system to evaluate added benefits and performance benchmarks that might be had during the image analysis stage over a standard distributed system. 17

19 Progress and Expected Contributions Progress - Conducted thorough research on opensource Image Classification packages as well as the tools required to perform feature extraction and analysis - Setup an environment for Hadoop distributed system - Acquired Yahoo Flickr dataset and begun initial testing - Acquired hardware and performed initial tests on a GPU based system Expected Contributions - Our goal is to research, implement, and build upon the current Open Source offerings for Image Classification to help improve the auto-tagging process of digital photos - If given the time and resources we aim develop a web based interface that will allow users to upload an image and perform a feature analysis and extraction to determine which tags and keywords are associated with it 19 Potential Challenges - Research for feature extract is very resource intensive so we have picked a challenging project for only two students - Conducting such a challenging project will involve many abstract mathematical models - Companies like Yahoo and Facebook invest millions in this field - *We anticipate this to be an ongoing project that we can continue well beyond the course and perhaps into the second semester

26 Motivation Forex trading requires statistical insight into the exchange market. Large quantity of data, visualization only utilized at the day/week/ month level Difficult to see real time trends, analyze real time trends at granularity < 1 day Need to be able to collect, analyze, visualize data streaming in real time Solution Distributed Computing Real Time Computation Statistical Analysis Data Visualization 26

27 Datasets Large amounts of intraday/daily forex/equity/other data Algorithms Recommenders - suggesting trading prices and items to exchange Clustering - to analyze trends over a variable period of time Classifying - to classify trends into upward/downward movements, Tools momentum Java and Mahout for the analytics Javascript, Python, and R for data gathering, web server, and visualization 27

28 Progress and Expected Contributions Forex data acquired, sanitized, formatted, ready to 28 use Built system to batch collect data from multiple feeds when it becomes available Current stage: building design and field research Next steps: other distributed computing libraries End contribution: an extensible framework for collecting, analyzing, and visualizing real time data feeds

30 Hypothesis and Method Hypothesis Low volume stocks typically do not generate mainstream news coverage We hypothesize that social media could be a useful source of information Method Backtest different methods of using Big Data (specifically Twitter) to ultimately try to predict future price movements We will test various cases attempting to seek a correlation between tweets and movement of low volume stocks in price or volume We will verify whether these tweets are leading or lagging indicators of price or volume changes 30

34 Motivation Traditional trading strategies usually involved with time series models such GARCH. It is difficult to incorporate categorized parameters such Twitter data. Using classification models, we can give a prediction on whether the asset price will go up or down by incorporating unstructured data stream. 34

36 Progress and Expected Contributions Progress: Researched on various recent time series classification models Have set up interface with data API Expected Contributions: Provide a Hadoop based classification model implementations on time series 36

38 Motivation Stock Market has high profit and high risk features, on the stock market analysis and prediction research has been paid attention by people. The stock price trend is complex nonlinear function so the price has certain predictability Stock Market has high profit and high risk features, on the stock market analysis and prediction research has been paid attention by people. The stock price trend is complex nonlinear function so the price has certain predictability Hadoop MapReduce is a latest framework specially designed for processing large datasets on distributed sources. Apache s Hadoop is an implementation of MapReduce. 38

40 Progress and Expected Contributions Expected Contributions: We are going to analyze the dataset called Daily Holdings for All ProShares ETFs which contains tons of information collected from the stock exchange market. The first step is to scrutinize the data and provide the stock which may go up potentially. With these screened stocks, suggest a certain user a potential stock which she/he may be interested in. Progress: We have already got the dataset and analyzed the some similarities which could be useful in further steps. 40

42 Motivation 1. Motivation - Objective: Developing the Real-time Risk Management System (Intraday Value at Risk) for large complex portfolio in an unified framework - Expected Outcome: The system which performs the calculation of stressed VaR, "what-if" scenarios, stresstesting on complex portfolio with large number of underlying risk factors and vectors in real-time. - Importance: Risk management is crucial to throughout the investment/trading activities from front trading desk to back office. However, because of the complexity of calculating VaR in large multi-asset portfolio, delivering the VaR in real-time is not available at legacy system. Big Data with in-memory multi-dimensional analytics can resolve this big issue. 42

46 Motivation Traditional technical trading only take into account the quantitative but not qualitative factors that influence the stock prices. It is well know that news items have significant impact on stock indices and prices. To make a better prediction, we combine quantitative methods with headline NLP feature analysis in our model. 46

48 Progress and Expected Contributions 0.05 What we have done: Collected numerical data and parts of news. Built preliminary models, verifying our thoughts reasonable. Correlation (NLP features VS AAPL Stock Prices) R 2 of Regression Methods Method Train Test Baseline: Linear SVM with RBF kernel Day -3 Day -2 Day -1 Day 0 Day 1 Day 2 SVM with RBF of Indexes and Lag =1 on Stock What we will do: Implement the algorithm on the whole dataset. 48 Prove the positive impression of NLP on stock price prediction.

50 Motivation To help stock buyers to make wiser choices To find those who are very good at gaining profits from stocks (experts) Using user-based collaborative recommendation to find the similarity between the buyer and the expert Recommend the stock buyer some stocks from the most similar expert. To ease stock buyers from the heavy burden of looking through thousands of stocks and making a wise choice. 50

54 Motivation Stock movement due to white house related news on twitter Twitter based hedge fund Algorithmic trading Twitter data traditionally used for Sentiment Analysis But now also a great source to consume News real time Stock prices have correlation with news By applying appropriate filters on tweets by news agencies, and then scoring the filtered tweets we aim to generate signals for stock prices that could be consumed by algorithms or traders to do better trades The framework we are building is scalable and could potentially be used to generate signals for a portfolio of heterogeneous stocks 54

56 Progress and Expected Contributions So far we have completed the stream ingestion part and are working on refining our filtering and scoring algorithms to generate better correlation between generated signals and stock price movements. Below is a breakdown of work and contributions by team members 1. Ingesting data stream Mayank 2. Filtering Algo - Mandeep 3. Scoring Algo - Shreyas 4. Fetching real time stock data and plotting it together with stock signals generated in real time Rajesh 56

58 Motivation One of the biggest challenges for banks is minimizing customer attrition rate which is directly dependent on customer satisfaction. Customers are inclined to choose the banks who can be trusted for their services. Banks make their decisions based on a subset of data because of absence of scalable solutions. In this project, we propose a scalable design to counter the above problems 58

60 Progress and Expected Contributions Major retail banking issues by state and match-analyze them based on geographic or socio-economic brackets. Top concerns of consumers in various states. Derive business impact of customer satisfaction or dissatisfaction with their complaints on the institution. Propose likely solutions that can be deemed as first response for future complaints of similar nature. Hypothesize a performance metric to apply to all complaints can be used to prioritize complaints based on resolution time. 60

62 Motivation Understanding volatility in financial markets has long been of interest to hedge and speculators. Empirical evidence has shown us that volatility is a highly nonlinear evolving process. Modeling this process using the Hadoop ecosystem can offer tremendous advantages over traditional econometric models that are limited to datasets which fit in main memory. 62

63 Dataset, Algorithm, Tools Dataset: We have procured a massive dataset of price quotes on equities, exchange traded futures, futures, and market indices over the span of the last ten to fifteen years at the one minute granularity level. In addition to price quotes on specific instruments, our dataset features derivative indicators of price and volume activity. Algorithm: We propose to train and test several scalable machine learning based regression models on our dataset with the goal of producing a functional form of future realized volatility at the symbol level that minimizes bias and variance and ultimately generalizes well to unforeseen data. Feature selection will be integral to the task given the likelihood that many of our input variables are highly correlated. We intend to build a framework on top of Apache Spark that can at a minimum perform an n-fold cross validation of a training model and use beam search or other established methods to calibrate the hyperparameters of our SVM, random forest, or regularized regression model in a reasonably fast time frame given the algorithmic complexity of the underlying routines employed. Tools: Hadoop Apache Spark Mahout AWS Git 63 R, Java, Python

66 Motivation Main idea: job ad. Employers -would determine more reasonable salary for a position. Employees - could find more jobs match their background by using our recommendation system. We want to help employers and jobseekers figure out the market worth of different positions by building a prediction engine for the salary of any UK In this way, we would bring more transparency to this important market : Simple sample of our Salary Engine 66

W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar

DATA EXPERTS We accelerate research and transform data to help you create actionable insights WE MINE WE ANALYZE WE VISUALIZE Domains Data Mining Mining longitudinal and linked datasets from web and other

NetView 360 Product Description Heterogeneous network (HetNet) planning is a specialized process that should not be thought of as adaptation of the traditional macro cell planning process. The new approach

GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and

WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely

Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

DATA SCIENCE CURRICULUM Before class even begins, students start an at-home pre-work phase. When they convene in class, students spend the first eight weeks doing iterative, project-centered skill acquisition.

Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the

Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

Online Content Optimization Using Hadoop Jyoti Ahuja Dec 20 2011 What do we do? Deliver right CONTENT to the right USER at the right TIME o Effectively and pro-actively learn from user interactions with

Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.

Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data

A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

SAP Brief SAP HANA Objectives Transform Your Future with Better Business Insight Using Predictive Analytics Dealing with the new reality Dealing with the new reality Organizations like yours can identify

Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

CTOlabs.com White Paper: Hadoop for Intelligence Analysis July 2011 A White Paper providing context, tips and use cases on the topic of analysis over large quantities of data. Inside: Apache Hadoop and

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

CS535 Big Data W1.A.1 CS535 BIG DATA W1.A.2 Let the data speak to you Medication Adherence Score How likely people are to take their medication, based on: How long people have lived at the same address

SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

To ensure the functioning of the site, we use cookies. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy &amp Terms.
Your consent to our cookies if you continue to use this website.