Follow Gregory Piatetsky, No. 1 on LinkedIn Top Voices in Data Science & Analytics

KDnuggets Home » News » 2014 » Jul » News, Features » Poll Results: Largest Dataset Analyzed surprisingly stable ( 14:n19 )

Poll Results: Largest Dataset Analyzed surprisingly stable


The results of KDnuggets annual poll on Largest Dataset Analyzed show surprising stability over the last 3 years, with about 54% of answers in GB range, and confirm the gap between the internet-scale data miners and the rest.



By Gregory Piatetsky, @kdnuggets, Jul 17, 2014.

Latest KDnuggets Poll asked: What was the largest dataset you analyzed / data mined?

The results, based on 392 votes, show a pattern that has remained surprisingly stable over the last 3 years
  • over 50% of answers are in the Gigabyte range (median answer between 11 and 100 GB for each each year 2012-14)
  • a small number (2-3%) of Big Data miners are working with internet-scale data sets (over 100 PB), at companies like Google and Facebook.
  • a small but significant gap, with almost no answers in 1-100 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and 100 PB+ Internet-scale data stores.

 
KDnuggets Poll: Largest Dataset Analyzed, 2012-2014

We can see the trends more clearly by grouping the answers into ranges for Megabytes (< 1GB), Gigabytes (1-999 GB), Terabytes (1-999 TB), and Petabytes (>1 PB). We will call data scientists with largest dataset analyzed in each range Megabyte analysts, Gigabyte analysts, etc.

The global percent of Gigabyte analysts slightly increased from 53% in 2012 to 54% in 2014. The percent of Megabyte analysts has steadily declined, as expected, while percent of Terabyte analysts has grown slightly, from 16% to 18%.

KDnuggets Poll: Largest Dataset Analyzed, 2012-2014, ranges

Here is a similar chart just for the US, which shows decline of in percent of Gigabyte analysts and the corresponding growth in Terabyte and Petabyte analysts.

KDnuggets Poll: Largest Dataset Analyzed, 2012-2014, for US/Canada

Regional participation was
  • 38%, US/Canada
  • 31%, Europe
  • 18%, Asia
  • 6.9%, Latin America
  • 3.3%, Africa/MidEast
  • 2.6%, AU/NZ

 
The chart below shows the distribution of Largest Dataset Ranges by Region, sorted by % of TB+ answers. We see that US/Canada and AU/NZ lead, with about 30% of their data miners having worked on TB-size databases. Next is Europe (19%), Latin America (15%), Asia (10%), and Africa/MidEast (7.7%).

KDnuggets 2014 Poll: Largest Dataset Analyzed, by region

Here are the results of past polls:
 

Sign Up