By Anmol Rajpurohit, Feb 27, 2014.
One of the biggest challenges at big conferences such as Strata 2014 is that there is so much happening quickly and simultaneously that it is almost impossible to catch all the action.
We help you by summarizing the key insights from some of the best and most popular sessions at the conference. These concise, takeaway-oriented summaries are designed for both – people who attended the conference but would like to re-visit the key sessions for a deeper understanding and people who could not attend the conference.
See also: Strata 2014 Santa Clara: Highlights from Day 2 (Feb 12)
Data Journalism – Organized Crime and Corruption Reporting by Drew Sullivan, Organized Crime and Corruption Reporting Project
Drew started with emphasizing that most of the real-life problems come from interdisciplinary areas and not just one specific area. Organized crime and corruption has global business of about 2-3 trillion US Dollars. Countries with very high amount of such activities are Russia, Montenegro, Kosovo, Eq. Guinea, North Korea, and Uzbekistan.When working with corrupt government official in anti-democratic and weak states becomes stronger, it threatens local and regional security.
Meanwhile, the proceeds from crime are eagerly sought by western banks, hedge funds and markets. He displayed details of a Tormex bank account
as an example. Over half a billion US dollars were poured into this Latvian bank account of the phantom company in a period of less than two years.
He urged to the data science community to come forward and help them to stop this large-scale illegal activity by hack and track method. He concluded by saying that efficient investigative reporting is the result of cooperation between investigative journalists, programmers and others who want to use data to contribute to create a cleaner, fairer and more just global society.
Enabling Business Transformation with Analytics over Real-time Streaming Data by Anand Venugopal and Pranay Tonpay, Impetus Technologies Inc.
Anand initiated the session referring to the retina in a human eye, which communicates with brain in real-time at 10 million bits per second. Real-time streaming analytics is not just about low latency queries over batch data.
He explained business transformation as a whole new domain of new possibilities and unexpected breakthroughs in operational efficiency. He classified business transformation use cases into three categories:
- Existential use cases
- Fraud analytics in CC company
- IT or other security systems
- Enhancement use cases
- Financial Trading: Risky/fraudulent trades
- Digital Advertising: Optimization
- Predictive vs. reactive maintenance
- Transformational use cases
- Retail: Detecting location and serving customer
- Insurance: Drone flight images for claim adjudication
- Agriculture: Satellite image analysis for optimal plot
- Healthcare: Complex models for disease detection in real-time
Next, he compared the two prevalent approaches to stream analytics – using proprietary algorithms and doing it yourself. The major challenges with the former approach are vendor lock-ins (in other words, lack of flexibility) and no opportunity to leverage the open source movement. On the other hand, the “do it yourself” has challenges such as managing the integration and management of open source tools; and also the significant delay in time to market.
After providing an overview of the current business landscape, Anand and Pranoy presented real-time streaming analytics as an offering from Impetus, which would provide various features such as high-speed data ingestion, elastic scaling, variety in data parsing, pluggable persistence, real-time index and search, dynamic message routing and many more. They summarized their approach as the iterative cycle of:
Sense -> Analyze -> Act -> Sense
Graph Analysis with One Trillion Edges on Apache Giraph by Avery Ching, Facebook
Avery explained his motivation behind graph analysis by showing images, recommendation and network graph. Graph analytics have applications beyond large web scale organizations. Many computing problems can be efficiently expressed and processed as a graph, which can lead to useful insights that drive product and business decisions.
Next, he described the lifecycle of Giraph, which was inspired from Google’s Pregel but it runs on Hadoop. He showed how most of the problems could be related with graphs when we think of sub-problems as vertices. Providing an overview of Apache Giraph data flow, he mentioned the few features of Giraph which are not in Pregel such as shared aggregators, master computation and composable computation.
He briefly explained techniques such as balanced propagation, super-step splitting and avoiding out of core. He concluded by stating the future research and development efforts should focus on evaluation of alternative computing models, performance, lowering the barrier to entry and applications.
Socializing Search. Professionally.
by Sriram Sankar and Daniel Tunkelang, LinkedIn
Daniel started with providing an overview of how LinkedIn is addressing search quality issues through leveraging the economic graph. Social context means that the relevance of search results is highly personalized. He explained how machine learning ranks socially using model of tree with logistic regression leaves. Focusing on its customer base, LinkedIn is moving towards an entity-oriented search i.e. when searched for a term the results displayed should belong to all entities such as personal profile, company profile, employees of company, job openings, etc. He mentioned that query understanding acts as a relevance filter having phases such as segmentation, decoding, query rewrite resulting to new query. He also made an announcement that LinkedIn would soon have entity-driven search assistance feature.
Sriram talked about the unique infrastructure challenges posed by the efforts to improve LinkedIn search. He explained how they are leveraging Lucene while developing additional components to support LinkedIn’s specific needs. He asserted that it is extremely easy to build a search engine but difficult to get sophisticated. Next, he explained the LinkedIn Search Stack which leverages the search index served by Lucene to achieve statics rank based document ordering. Regarding performance, he mentioned offline data builds on Hadoop along with partial index updates. Scoring is done after retrieval, with the option to compute the costly features offline. Sriram concluded with suggesting that though this approach has been developed specifically for LinkedIn’s needs, the concepts and learning can be applied to a wide range of other problems.
Strata 2014 Santa Clara: Highlights from Day 2 (Feb 12).