The Popularity of Data Science Softwareby Robert A. Muenchen. Abstract. This article, formerly known as The Popularity of Data Analysis Software, presents various ways of measuring the popularity or market share of software for advanced analytics software. Such software is also referred to as tools for data science, statistical analysis, machine learning, artificial intelligence, predictive analytics, business analytics, and is also a subset of business intelligence. List of well known, registered, and dynamic/private ports. Software covered includes: Actuate, Alpine, Alteryx, Angoss, Apache Flink, Apache Hive, Apache Mahout, Apache MXNet, Apache Pig, Apache Spark, BMDP, C, C++ or C#, Caffe, Cognos, Data. Robot, Domino Data Labs, Enterprise Miner, FICO, FORTRAN, H2. O, Hadoop, Info Centricy or Xeno, Java, JMP, Julia, KNIME, Lavastorm, MATLAB, Megaputer or Poly. Analyst, Microsoft, Minitab, NCSS, Oracle Data Miner, Prognoz, Python, R, Rapid. Miner, Salford SPM, SAP, SAS, Scala, Spotfire, SPSS, SPSS Modeler, SQL, Stata, Statgraphics, Statistica, Systat, Tableau, Tensorflow, Teradata, Vowpal Wabbit, WEKA/Pentaho, and XGboost. Updates: The most recent update was the Scholarly Articles section 6/1. I announce the updates to this article on Twitter: http: //twitter. Bob. Muenchen. Introduction. When choosing a tool for data analysis, now more commonly referred to as analytics or data science, there are many factors to consider: Does it run natively on your computer? Does the software provide all the methods you need? If not, how extensible is it? Does its extensibility use its own unique language, or an external one (e. Python, R) that is commonly accessible from many packages? Does it fully support the style (programming, or menus and dialog boxes, or workflow diagrams) that you like? Are its visualization options (e. Does it provide output in the form you prefer (e. La. Te. X integration)? Does it handle large enough data sets? Do your colleagues use it so you can easily share data and programs? Can you afford it? There are many ways to measure popularity or market share and each has its advantages and disadvantages. In rough order of the quality of the data, these include: Job Advertisements. Scholarly Articles. IT Research Firm Reports. Surveys of Use. Books. Blogs. Discussion Forum Activity. Programming Popularity Measures. Sales & Downloads. Must be invoked after an estimation command. Performs a joint test for the addition of the specified variables to the last model, the results of which may be.The Digital POWRR Team had gathered information on many Digital Preservation tools that included OAIS features, descriptions, cost, websites and more. Competition Use. Growth in Capability. Let’s examine each of them in turn. Job Advertisements. One of the best ways to measure the popularity or market share of software for data science is to count the number of job advertisements for each. Job advertisements are rich in information and are backed by money so they are perhaps the best measure of how popular each software is now. Plots of job trends give us a good idea of what is likely to become more popular in the future. Indeed. com is the biggest job site in the U. S., making its collection the best around. As their co- founder and former CEO Paul Forster stated, Indeed. Monster, Careerbuilder, Hotjobs, Craigslist – as well as hundreds of newspapers, associations, and company websites." Indeed. Searching for jobs using Indeed. Some software is used only for data science (e. SPSS, Apache Spark) while others are used in data science jobs and more broadly in report- writing jobs (e. SAS, Tableau). General- purpose languages (e. C, Java) are heavily used in data science jobs, but the vast majority of jobs that use them have nothing to do with data science. To level the playing field I developed a protocol to focus the search for each software within only jobs for data scientists. The details of this protocol are described in a separate article, How to Search for Data Science Jobs. All of the graphs in this section use those procedures to make the required queries. I collected the job counts discussed in this section on February 2. One might think that a sample of on a single day might not be very stable, but the large number of job sources makes the counts in Indeed. The last time I collected this data was February 2. They grew between 7% and 1. Figure 1a shows that SQL is in the lead with nearly 1. Python and Java in the 1. Hadoop comes next with just over 1. R, the C variants, and SAS. The C, C++, and C# are combined in a single search since job advertisements usually seek any of them). This is the first time this report has shown more jobs for R than SAS, but keep in mind these are jobs specific to data science. If you open up the search to include jobs for report writing, you’ll find twice as many SAS jobs. Next comes Apache Spark, which was too new to be included in the 2. It has come a long way in an incredibly short time. For a detailed analysis of Spark’s status, see Spark is the Future of Analytics, by Thomas Dinsmore. Tableau follows, with around 5,0. The 2. 01. 4 report excluded Tableau due to its jobs being dominated by report writing. Including report writing will quadruple the number of jobs for Tableau expertise to just over 2o,ooo. Figure 1a. The number of data science jobs for the more popular software (those with 2. Apache Hive is next, with around 3,9. Scala, SAP, MATLAB, and SPSS, each having just over 2,5. After those, we see a slow decline from Teradata on down. Much of the software had fewer than 2. When displayed on the same graph as the industry leaders, their job counts appear to be zero; therefore I have plotted them separately in Figure 1b. Alteryx comes out the leader of this group with 2. Microsoft was a difficult search since it appears in data science ads that mention other Microsoft products such as Windows or SQL Server. To eliminate such over- counting, I treated Microsoft different from the rest by including product names such as Azure Machine Learning and Microsoft Cognitive Toolkit. So there’s a good chance I went from over- emphasizing Microsoft to under- emphasizing it with only 1. Next comes the fascinating new high- performance language Julia. I added FORTRAN just for fun and was surprised to see it still hanging in there after all these years. Apache Flink is also in this grouping, which all have around 1. H2. O follows, with just over 1. I find it fascinating that SAS Enterprise Miner, Rapid. Miner, and KNIME appear with a similar number of jobs (around 9. Those three share a similar workflow user interface that make them particularly easy to use. The companies advertise the software as not needing much training, so it may be possible that companies feel little need to hire expertise if their existing staff picks it up more easily. SPSS Modeler also uses that type of interface, but its job count is about half that of the others, at 5. Bringing up the rear is Statistica, which was sold to Dell, then sold to Quest. Its 3. 6 jobs trails far behind its similar competitor, SPSS, which has a staggering 7. The open source MXNet deep learning framework, shows up next with 3. Tensorflow is a similar project with a 1. I expect both will be growing rapidly in the future. In the final batch that has few, if any, jobs, we see a few newcomers such as Data. Robot and Domino Data Labs.Others have been around for years, leaving us to wonder how they manage to stay afloat given all the competition. there. It’s important to note that the values shown in Figures 1a and 1b are single points in time. The number of jobs for the more popular software do not change much from day to day. Therefore the relative rankings of the software shown in Figure 1a is unlikely to change much over the coming year. The less popular packages shown in Figure 1b have such low job counts that their ranking is more likely to shift from month to month, though their position relative to the major packages should remain more stable. Each software has an overall trend that shows how the demand for jobs changes across the years. You can plot these trends using Indeed. Job Trends tool. However, as before, focusing just on analytics jobs requires carefully constructed queries, and when comparing two trends at a time, they both have to fit in the same query limit. Those details are described here. I’m particularly interested in trends involving R so let’s see how it compares to SAS. In Figure 1c we see that the number of data science jobs for SAS has remained relatively flat from 2. February 2. 8, 2. I made this plot. During that same period, jobs for R grew steadily and finally surpassed jobs for SAS in early 2. As noted in a blog post (and elsewhere in this report), use of R in scholarly publications surpassed those for SAS in 2. Comparison of three generations of Acti. Graph activity monitors under free- living conditions: do they provide comparable assessments of overall physical activity in 9- year old children? The main finding of this study is that the Acti. Graph model AM7. 16. Acti. Graph models GT1. M and GT3. X+ in a free- living setting, while the generations GT1. M and GT3. X+ provide close to similar outputs. The differences between the old and the newer monitors were more complex when investigating time spent at different intensities. Assuming that the GT1. M and GT3. X+ provide a more precise and stable mcpm output compared to the model 7. AM7. 16. 4 yield higher outputs of mcpm compared with the GT1. M and GT3. X+. This would indicate that physical activity levels assessed by the AM7. This could further have impact on public health policies and efforts to address a decreasing physical activity level based on methodological challenges rather than true observations. The results from the intra- class correlations show almost perfect agreement. However, perfect agreement is the assumption for researchers using different monitor generations within studies and when comparing results across studies. Several validation studies including these monitors have been done over the last years and the conclusions vary. Most validations are done in mechanical setups or in a controlled laboratory settings [1. One explanation for the varying conclusions can be that the results are population specific and different results can be expected dependent on age and activity type [1. However, Reilly et al. Acti. Graph accelerometer outputs have little age- or size- related systematic variation for the same behavioral input across a wide age/size range (3–1. Cain et al. [1] state that there is growing evidence of differences in sensitivity of Acti. Graph accelerometers outputs among adults, and that it is unclear how model differences affect interpretation of data from children. Our results support the limited cluster of research stating that there is a difference between the old AM7. Acti. Graph models, and that these findings might affect interpretation of accelerometer data obtained from children and adolescents [1. Our results also support the growing number of studies showing that data assessed by the newer generation Acti. Graphs, from GT1. M and forward, can be compared and used interchangeably [1. The observed differences varied in magnitude across intensity- levels. The largest differences were seen at the highest intensities, where the children spent the least amount of time (less than 1% of the measured time). As both size and direction of the inter- generation differences were intensity dependent, the absolute difference would depend on time spent at the certain intensities, and the cut- points applied. Furthermore, the epoch length also appears to affect the outcome when comparing outcomes of different monitors [1. Some authors have suggested applying a correction factor to data obtained by one of the monitor generations to correct for this difference, for mean physical activity (mcpm). Corder et al. [2] suggested multiplying the data derived from the AM7. GT1. M. The corresponding correction factor in this study would be to multiply data assessed by the AM7. GT1. M- data for mcpm, and 0. GT3. X+ - data for mcpm, based on the relative differences in mcpm of 1. As the inter- generation difference in mcpm varied across intensities, we acknowledge that correcting the mcpm might introduce an unknown bias. We do not know the size of bias caused by frequency and amplitude. The suggested correction factors would only apply to similar distributions of time spent across intensities. The results of this study might imply that intensity- specific cut- points should be generation specific. However, in order to provide such recommendations, the study needs to be repeated in larger samples. Based on these considerations we did not find it appropriate to suggest intensity- specific correction factors to aid the demonstrated divergence. However, we urge for caution when comparing intensity- specific data assessed by AM7. Acti. Graph accelerometers. As the AM7. 16. 4 was discontinued in the mid 2. However, we worry that future studies will attempt to compare data across studies including data from the AM7. Such comparisons across Acti. Graph generations, including the AM7. Strengths and limitations. The strength of this study was the multiple accelerometers worn simultaneously of children in a free living condition. However, there are some limitations. The sample size was small and we experienced a relatively large drop out due to incomplete data. However, despite the small sample we observed significant differences between monitors. Furthermore, we did test our hypothesis in a mechanical setup and a second free living study (n = 2. The main findings that AM7. AM7. 16. 4 should be treated with caution in comparison with data assessed by the newer generations of Acti. Graphs. The study comprises 9- year- old children only, and this hampers the generalizability of the results to other populations such as adults. Furthermore, as multiple settings exist (regarding epoch length, definition of valid days, non- wear time, intensity cut- points etc.) this limits the generalizability of these findings to apply to other settings.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |