CPSC 371 Visualization and Data Mining Spring 2007

CPSC 371 Course Information

Minard's drawing of
Napoleon's march to Moscow Snow's cholera map Beck's London
Underground map "...because a picture is worth a thousand words"

Information visualization is an area of computer science concerned with helping users understand data through visual representations. Visualization is a powerful technique: it is possible to pack large quantities of information into a manageable size, and people are generally quite good at detecting patterns, relationships, and anomalies visually.

Data mining involves computational techniques for discovering previously unknown patterns and relationships in large data sets. This is of great interest for companies wishing to analyze customer buying habits in order to increase the company's profit or market share and for governments trying to catch terrorists, to name two examples.

This course will be organized around the theme of discovery and communication of interesting patterns and relationships in data. Topics include the principles and practices of effective visual communication, techniques for exploration and discovery, how visualization and data mining complement each other, and a consideration of the societal impacts and ethics of employing this technology.


Instructor

Stina Bridgeman
bridgeman@hws.edu
Lansing 312, x3614


Office Hours

M 12:30-1:30pm, W 3-4:30pm, R 10:30am-noon, F 10:30-11:30am
or by appointment (schedule)


Class Hours and Meeting Place

Lecture MWF 1:55pm-2:50pm, Lansing 301


Course Web Page

http://math.hws.edu/~bridgeman/courses/371/s07/
You are expected to regularly consult the course web page for announcements, assignments, and most handouts.


Texts

Data Mining, 2nd edition
Ian H. Witten & Eibe Frank
Morgan Kaufmann, 2005

The following books are on reserve at the library:

  • The Visual Display of Quantitative Information, Edward R. Tufte
  • Envisioning Information, Edward R. Tufte
  • Visual Explanations, Edward R. Tufte

Additional material will be handed out or posted on the course webpage.


Prerequisites

C- in CPSC 225, or instructor permission


Rationale & Aims

This course, like the other 300- and 400-level computer science courses, explores a particular topic in computer science. The roots of visualization and data mining, however, are far-flung - including computer graphics, human-computer interaction, cognitive psychology, semiotics, graphic design, cartography, art, algorithms, statistics, artificial intelligence, machine learning, information retrieval, and pattern recognition.

The rationale for this course is straightforward: data is all around us. It is being produced in ever-larger quantities - scientific data sets, medical data sets, customer buying habits, websurfing habits, credit histories, census data, ...the list goes on. In order for all this data to be useful, it needs to be examined - for patterns, relationships, anomalies, or anything else that might be interesting. Visualization and data mining provide methods for doing this, and that is the relevance.

This course has three primary aims:

  • a survey of techniques, to build a toolbox of visualization and data mining methods (and the knowledge of when each is applicable) which can be used in practical situations

  • understanding of the basic principles and building blocks of visualization and data mining methods, to be able to explain why the specific techniques surveyed are effective and to enable the creation of new techniques when established ones aren't sufficient

  • critical thinking, about the effectiveness of a visualization, the truthfulness of the conclusions drawn from a result, and the social and ethical considerations behind the responsible use of technology


Course Content Overview

There will be three main components to the course: visual communication, exploration, and the social and ethical considerations of the technology.

Visual Communication: The first part of the course will focus on the principles and practices of effective visual communication, including aspects of human perception and cognition, the basic building blocks of visual representations, graphical integrity, and graphical excellence. These principles will be applied as we build a toolbox of visualization techniques, and critique the effectiveness of these techniques for particular tasks.

Exploration: The second part of the course will consider visual and computational methods for the exploration of data sets and the discovery of things of interest. Topics include the basic principles and techniques of building interactive visualizations, applications of those principles and techniques to particular tasks, fundamental data mining tasks and algorithms, applications of those algorithms to particular tasks, and how data mining and visualization complement each other.

Social and Ethical Considerations: Data mining, in particular, raises many concerns about privacy, legality, and ethics - while it also offers many potential benefits. The final part of the course will examine these issues.


Assignments and Evaluation

Online Discussion: Discussion and reflection on the material is an important part of this course. As part of this, you are expected to contribute to the online discussion in the course wiki. This involves both keeping up with what others have posted, as well as posting your own comments. Details can be found on the DiscussionRequirements page in the wiki.

Reading Response: Reading response questions for each reading assignment will be posted on the course wiki, and are due at the start of class on the day for which the reading is assigned. These questions are intended to help you focus on the key points of the reading, to reflect on what you've read, and to prepare you for class discussions. Details can be found on the ReadingResponseRequirements page of the wiki.

Homework: Homeworks are designed to reinforce topics covered in class, and to allow you to explore concepts on your own. There will be a variety of types of homework assignments; some will be writing-based (such as critiques of visualizations), while others will be more technical and hands-on (such as using particular tools to explore data sets or construct visualizations).

Project: A major component of the course will be a project in which you will apply the skills you've gained to a real problem. Your role will be that of a consultant working for a client who has a data set and some questions to investigate - you'll interview the client to discover their needs, design and implement a way to answer their questions, and produce a report.

Practicums: In lieu of more typical exams, there will be three practicums (two involving visualization and one involving data mining). In these, you'll be given a data set and a few tasks to address; your task will be determine how to address those tasks and to produce a report presenting and critiquing your solution. The purpose is to demonstrate that you can apply the theoretical principles studied in the course to practical situations.

Final Grades: Final grades in this course will be computed as follows:

  • Online Discussion: 5%
  • Reading Response: 10%
  • Homework: 35%
  • Final Project: 25%
  • Practicums: 8.3% each (25% total)

Attendance and Participation: On-time attendance and class participation (see the course policies) are expected, though they are not formally factored into your final grade. Missing class - for any reason - often results in lower grades because important material was missed. Similarly, not participating in class even if you are physically present may mean that you aren't actively following the material and thus may be missing more sophisticated or subtle points.
Whether or not your grade is impacted for these reasons is up to you - it is your responsibility to get notes from another student or otherwise catch up on missed material, and this should be done promptly so that you do not fall behind. Also note that class participation and the number of unexecused absences are considered when deciding whether or not to round up a final grade which is just below a grade-level cutoff. See the course policies for the definition of unexcused and excused absences.

Late Policy and Collaboration: See the course policies for the late policy and collaboration policy.


Valid HTML 4.01!