Search results

    Search results

    Show all results for ""
    Can not find any results or suggestions for "."

    Search tips

    • Make sure there are no spelling errors
    • Try different search terms or synonyms
    • Narrow your search for more hits

    How can we help?

    Contact Us

    Find Employees

    University of Skövde, link to startpage
    University of Skövde, link to startpage
    Close

    Search results

      Search results

      Show all results for ""
      Can not find any results or suggestions for "."

      Search tips

      • Make sure there are no spelling errors
      • Try different search terms or synonyms
      • Narrow your search for more hits

      How can we help?

      Contact Us

      Find Employees

      University of Skövde, link to startpage

      Dissertation: Visualizing Cluster Patterns at Scale: A Model and a Library

      Date 12 March Time 13:15 - 17:00 Location Insikten, Portalen, or via Zoom

      Elio Ventocilla defends his thesis "Visualizing Cluster Patterns at Scale: A Model and a Library".

      The dissertation is held in Insikten, Portalen, but is also live streamed on Zoom. Because of current recommendations it is adviced to join the stream via Zoom. Note that only up to eight people are recommended to be present in Insikten.

      Click on the link to see the dissertation om Zoom.

      Join the livestream

      Abstract

      Large quantities of data are being collected and analyzed by companies and institutions, with the aim of extracting knowledge and value. When little is known about the data at hand, analysts engage in exploratory data analysis to achieve a better understanding. One approach in doing so is through the modeling and visualization of a dataset’s structure, i.e., the neighborhood relations among its data points, and their distribution in the multidimensional space. Such a process allows users to disclose and discover neighborhoods, outliers and cluster patterns—insights that enable more informed subsequent analytical decisions.

      Visualizing the structure of multidimensional data (i.e., with four or more dimensions or features) is generally done via two steps: modeling distance or neighborhood relations among data points, and visually encoding those modeled relations. As datasets grow in size (number of data points) and dimensionality (number of features), different scalability challenges arise. High-dimensional datasets, on the one hand, are more sparse in the multidimensional space, making it more difficult to make meaningful assessments about distances during the modeling, which, in turn, hinders the meaningfulness of the visual representations. Large datasets, on the other hand, make it more difficult to maintain the usability of a system in terms of the effectiveness of the visual representation (due to clutter or overplotting), or the efficiency of the solution (time and memory-wise). Different approaches have been proposed to overcome these challenges, but they apply to a particular combination of modeling and visual encoding, and their usability still degrades when dealing with very large, potentially distributed, multidimensional datasets.

      Moreover, the availability or format of existing Visual Analytics solutions (i.e., solutions that aid data analysis through Machine Learning, visual and interactive techniques) for visualizing data structure—and, hence, cluster patterns—presents an accessibility challenge to the data science community. Namely, many solutions are either unavailable for use, thus requiring their re-implementation, or come as domain-dependent or standalone applications which are too rigid to use in other scenarios, or to integrate with other data analysis tools.

      This thesis addresses these challenges and makes two contributions: a process model, describing a generic approach for the effective and efficient visualization of cluster patterns in large and multidimensional datasets; and an open source library for the interactive visualization of cluster patterns, even in distributed datasets, packaged in an accessible format that allows its integration with other tools within a data analysis environment. The process model suggests sampling and vector quantization to avoid cluttering and overplotting, as well as to improve the efficiency of the system in terms of memory and latency (i.e., times taken to produce visual feedback from the modeling process and from user interactions). The library instantiates one of the possible configurations of the model, using Apache Spark for distributed computations, the Growing Neural Gas for vector quantization, and Force-Directed Placement for constructing the two-dimensional layout. Seven research publications provide empirical and theoretical groundings to the validity of both the model and the library.

      Opponent

      Katerina Vrotsou, Senior Lecturer, Linköping University

      Supervisors

      Maria Riveiro, Associate Professor, Jönköping University
      Göran Falkman, Associate Professor, University of Skövde
      Rafael M. Martins, Senior Lecturer, Linnaeus University

      Committee

      Per Backlund, Professor, University of Skövde
      Hans-Jörg Schulz, Associate Professor, Aarhus University, Denmark
      Veronica Sundstedt, Senior Lecturer, Blekinge Institute of Technology
      Yacine Atif, Professor, University of Skövde

      Contact

      PhD Student

      Published: 2/12/2021
      Edited: 2/12/2021
      Responsible: webmaster@his.se