Follow Gregory Piatetsky, No. 1 on LinkedIn Top Voices in Data Science & Analytics

KDnuggets Home » News » 2014 » Nov » Opinions, Interviews, Reports » AnalyticsStreet Panel Report: Frontiers and Dangers of Analytics and Big Data ( 14:n30 )

AnalyticsStreet Panel Report: Frontiers and Dangers of Analytics and Big Data


A summary of technology panel I moderated with analytics leaders from DataXu, Tamr, Sqrrl, Nutonian, and Basis Technology. Key issues that emerged include machine learning, streaming data, unstructured and text data, the data mining process, security, and privacy.



By Gregory Piatetsky, @kdnuggets, Nov 18, 2014.

AnalyticsStreet Earlier this month I moderated a panel on the "Frontiers of Analytics" at Analytics Street, Boston Data Analytics Conference, which was organized by Vishal Kumar, @AnalyticsWeek.

We had a great team with 5 panelists from Boston-area companies that are prominent in analytics and Big Data:
  • Shashank Agarwal, data scientist at DataXu
  • Nidhi Aggarwal, Strategy and Marketing lead at Tamr
  • Adam Fuchs, CTO/co-founder, Sqrrl
  • Andrew Lamb, Chief Architect, Nutonian
  • David Murgatroyd, VP Engineering, Basis Technology

 
Unfortunately, we did not have enough time to adequately get into the issues.

To share their wisdom with KDnuggets readers, I have asked the panelists to send me the answers to 2 questions we discussed at the panel, and a third question added later:
  1. What is the most important frontier of analytics/ Big Data that your company is addressing now
  2. What are other 2-3 most important frontiers of analytics/ Big Data that your company is NOT addressing in the next 2-3 years? What other companies are leaders in those areas?
  3. What are the most significant risks to analytics / Big Data projects that companies /organizations need to address?


Some of the important themes that emerge from their answers are the importance of machine learning, dealing with streaming data, unstructured and text data, data preparation, security, and privacy.

Here are the full answers, in alphabetical order of panelists.

Shashank Agarwal Shashank Agarwal, a data scientist at DataXu, an ad-tech company in the programmatic marketing space. He works on projects related to machine learning and data mining. His background is in Biology and text mining. He has published a number of journal articles and medical notes from hospitals.

Q1. What is the most important frontier of analytics/ Big Data that your company is addressing now?

Shashank Agarwal: DataXu uses machine-learning algorithms to make decisions about the correct ads to show to an audience. This happens in less than 100 milliseconds, because these decisions need to be made while the page is being loaded on the user's end.

Traditional "direct response" digital campaigns tend to focus on a single goal, be it to drive clicks, a purchase, or a conversion action like "request a quote." Increasingly, I've noticed more advertisers, especially ones focused on improving awareness or favorability for their brands, are looking to optimize towards multiple objectives together, and that's what I find to be the most interesting frontier. For example, was the ad in view on the user's end (think above the fold/below the fold), or did the ad serve to a bot, or did the ad show up on an appropriate publisher. Often, these need to be optimized simultaneously. Since we use machine learning at DataXu, this involves having multiple algorithms optimizing towards multiple objectives on the same opportunity to show an ad, and making a decision based on the priority of the objectives.

Q2. Excluding what your company is doing, what are other 2-3 most important frontiers of analytics/ Big Data that other companies are working on? What companies are leaders in those areas?

Shashank Agarwal: There are a couple of frontiers that I find to be extremely important.

1. Streaming ETL. ETL (Extract, Transform, Load) is a process through which logs that are collected over some time period (hour/day/week) are read (Extract), parsed (Transformed), and loaded onto a database (Load), so that they are available to downstream analytics processes or decision engines. Most companies, including DataXu, run this process either hourly or daily. However, in a real world setting, where thousands of transactions are taking place every second, an hourly or daily delay in having the latest information might be unacceptable. A case from DataXu's perspective: if we try to serve an ad impression, and discover that the ad was served to a fraudulent bot, then 20 seconds later, when that bot tries to generate another fraudulent impression, we want to be able to block it based on the data gathered from the first impression.

2. Schema-less (or flexible schema) data warehouse: In marketing, as in other fields, we see different sources of data, and they can all have different attributes. For example, an ad opportunity on a mobile device will include information on the device vendor, or carrier, which won't be seen on an ad opportunity seen from a web browser on a laptop. Or, we might start getting a new attribute that we had not seen before. In the current scenario of a schema-based data warehouse, we expect all the attributes to be known and defined before we start receiving data. This makes it difficult to adapt to the changing data. Also, adding new "columns" to handle a specific attribute (think carrier for mobile) means that for data where this information is not present, you end up filling nulls.

Schema-less architectures, such as the MongoDB and CouchDB are based on JSON and avoid these issues. However, one must be careful to ensure that attributes that can be normalized are being normalized. For example, geo location in Europe might include an attribute "postal_code", whereas in the US, it includes "zip_code", which are essentially the same thing, and should be normalized.

Q3. What are the most significant risks to analytics / Big Data projects that companies /organizations need to address?

Shashank Agarwal: Many companies feel that they need to jump on the big data bandwagon, and hire a team of data scientists, when in reality, the amount of data they have can fit on a single hard drive or even in memory. This results in them not deriving the value that they originally anticipated, and could result in them discounting the true value of big data and data science. Moreover, even if the data is truly large, and stored in a data warehouse, not extracting actionable insights from the data is also a risk.
Companies need to be realistic about the amount of data they have, and whether that data can bring them value before investing in big data.



Nidhi Aggarwal Nidhi Aggarwal, @aggarwalnidhi, is a Strategy and marketing lead at Tamr, a starup offering Big Data curation, and a founding board member Cloud vLab. Short bio: Nidhi Aggarwal leads strategy and marketing at Tamr. Prior to joining Tamr, Nidhi founded Cloud vLab, makers of qwikLAB, a software-learning platform used to create and deploy on-demand lab environments. In the years before Cloud vLab, Nidhi worked at McKinsey & Company, advising Fortune 150 companies on Big Data Strategy. Nidhi holds a Ph.D. in Computer Science from the University of Wisconsin-Madison.

Q1. What is the most important frontier of analytics/ Big Data that your company is addressing now?

Nidhi Aggarwal: The most important frontier of analytics/Big Data Tamr is solving is simplifying the process of preparing the data for consumption by analytics and visualization tools. Most of the time is spent stitching the data from various silos, cleaning it, transforming it. This process if called curation. Tamr dramatically reduces the time and effort required to do curation and speeds up the time to analytics. Putting data in data lakes gives the illusion of unification but those data lakes can turn into data swamps really quickly.
 
Q2. Excluding what your company is doing, what are other 2-3 most important frontiers of analytics/ Big Data that other companies are working on? What companies are leaders in those areas?

Nidhi Aggarwal: Tamr is plumbing (GP: provides the essential plumbing for the data mining process) so I am really excited about companies that are upstream and downstream for us. For eg. Basis Tech for unstructured data, Recorded Future for predictions based on data.

Q3. What are the most significant risks to analytics / Big Data projects that companies /organizations need to address?

Nidhi Aggarwal: The most significant risk is a lack of understanding of the process of going from raw data to analytics and not putting together a scalable process to get the data prepared.


Adam Fuchs Adam Fuchs, CTO and co-founder of Sqrrl, is responsible for ensuring that Sqrrl is leading the world in Big Data Infrastructure technology. Previously at NSA, Adam was an innovator and technical director for several database projects, handling some of the world's largest and most diverse data sets. He is a co-founder of the Apache Accumulo project.

Q1. What is the most important frontier of analytics/ Big Data that your company is addressing now

Adam Fuchs: At Sqrrl, we feel that Linked Data Analysis stands to change the game for making sense out of Big Data by putting it in context. What that involves is transforming raw data into linked information, and being able to explore and traverse the intricate relationships between different entities of interest. Not only that, but the attribution of derived knowledge back to its original source is a key feature in terms of data quality and provenance. There are obviously back-end technologies that are instrumental in making this happen, but just as important are effective ways to visualize and interact with the data and analytic results. These are all key challenges we're tackling today.

Q2. Excluding what your company is doing, what are other 2-3 most important frontiers of analytics/ Big Data that other companies are working on? What companies are leaders in those areas?

Adam Fuchs: Unstructured entity disambiguation is a huge area of analytics that includes, among other things, finding and normalizing concepts found in corpuses of human language text. Figuring out that the "Bill" in one document refers to the same being as the "William" in another relies on decades of research in automating what humans are naturally pretty good at. The best approaches to this problem today require scalability to corpuses in the Big Data range. Basis is doing some great work in this area, as well as another Boston company, Diffeo.

Another area of analytics that is dear to my heart is software analysis. This area includes proving that certain properties hold of programs and detecting behaviors that may not match expectations for what a program should do. From a cybersecurity perspective software analytics can be used for both attach and defense operations, detecting and exploiting vulnerabilities in the Big Data space of code that makes up a typical enterprise network. This is an area where I would like to see more commercial research and development. Many of the top cybersecurity firms are working in this problem space, but I think computer science departments at many universities are still leading the charge.

Q3: What are the most significant risks to analytics / Big Data projects that companies /organizations need to address?

Adam Fuchs: Obviously, given our heritage in the US Intelligence Community and current focus on cybersecurity, we feel that security is the biggest risk to Big Data and Analytics projects today. When you consolidate all your data in one place, as is the case with many Big Data efforts, you are compounding the risk of protecting that data. Coupled with the increased sophistication of motivated cyber attackers, many new Big Data efforts seem ripe for the picking as a potential target. As an industry, we're lacking both ways of securely keeping/accessing data as well as good tools for quickly investigating the impact of security incidents at scale. At Sqrrl, we're addressing both of these concerns.


Andrew Lamb Andrew Lamb, Chief Architect, Nutonian, responsible for driving the strategy and development of Nutonian's suite of cognitive computing products. Previously, as an Engineering Lead at Vertica (acquired by HP), Andrew designed, implemented and supported the distributed query planning and execution engine, SQL language features and performance optimizations for the flagship Vertica Analytic Database.

Q1. What is the most important frontier of analytics/ Big Data that your company is addressing now

Andrew Lamb: Lowering the skills needed to go from data to understanding and actionable insight. Specifically, Nutonian is focused on automating the process from data to insight with *software* rather than throwing more people at the problem. Other modeling techniques may give you predictive power, but you have to know which to try (or how to evaluate multiple options), and interpreting the results into English that you can understand what is going on with your business.

Q2. Excluding what your company is doing, what are other 2-3 most important frontiers of analytics/ Big Data that other companies are working on? What companies are leaders in those areas?

Andrew Lamb:
  • Raw data storage / retrieval (Structured): HP/Vertica (I am biased), Redshift (cloud)
  • Raw data storage / retrieval ("Unstructured"): HDFS (Hortonworks/Cloudera)
  • Data preparation / curation: I love Tamr, but it is a hard thing to do

 
Q3: What are the most significant risks to analytics / Big Data projects that companies /organizations need to address?

Andrew Lamb: Security - as we continue putting all this data into centralized systems that make it easy to understand and access, those systems will become very tempting targets for hackers and others.


David Murgatroyd David Murgatroyd, VP, Engineering at Basis Technology which he joined in 2005. He leads the engineering team responsible for text analytics. He has been building natural language processing systems since 1998.

Q1. What is the most important frontier of analytics/ Big Data that your company is addressing now

David Murgatroyd: Basis's expertise is in analytics of unstructured data, especially text. The frontier we're addressing now is better integrating unstructured and structured data by focusing on entity-centric systems, or as some say 'things, not strings'. That is, by resolving chunks of text to references of real-world people, places, organizations, products etc., analytics can be done over those items rather than by counting words and the like. Transforming the data to this level allows more effective user-computer collaboration since users think in terms of those entities not specific variations of their names. An example of this work is our recently launched Entity Resolver (www.basistech.com/text-analytics/rosette/entity-resolver/).

Q2. Excluding what your company is doing, what are other 2-3 most important frontiers of analytics/ Big Data that other companies are working on? What companies are leaders in those areas?

David Murgatroyd: One important frontier is the ever more nuanced trade-off between privacy and features. An example is the advent of Ello as an alternative to Facebook. Airbnb vs a hotel is a similar dichotomy with a differing motivation for decreased privacy: more trust in the case of Airbnb vs. more interaction in the case of Facebook.

Another frontier is analytics for social good. Firms like hrdag.org and globalforestwatch.org are good examples. Hopefully the continuing advance and simplification of open source tools developed for more commercial applications will enable these sorts of uses.

Q3: What are the most significant risks to analytics / Big Data projects that companies /organizations need to address?

David Murgatroyd: Commercial analytics projects need to avoid appearing 'creepy'. Individuals anthropomorphize companies so when it's revealed that a company has been tracking an individual across many channels the individual naturally thinks 'this feels like stalking'. Giving users transparency, control and clear value from the use of their data seem important steps to address this.


What do you, readers, think?

Related:
 

Sign Up