Data Mining Program
The Data Mining (DM) Program carried out research to improve techniques
for managing and interpreting large and complex data sets. Organisations
increasingly use data mining to analyse their data to solve such
business problems as risk management, market segmentation and fraud
detection. Data mining allows for a better understanding of market
and client behaviour, and can often lead to gains in competitive
advantage.
Data mining requires specialist expertise from disciplines such
as statistics, machine learning and visualisation. Algorithms must
be developed for automating data sorting and verification tasks,
for visualising and exploring data sets, and for identifying patterns
and anomalous groups in data sets. To bring focus to these activities,
the DM Program set up three projects:
NumAlg was the precursor to ParAlg, both of which aimed to develop
state-of-the-art vector/parallel algorithms and implement them in
high quality transportable code. These algorithms made efficient
use of parallel computers and helped identify non-conformities in
large data sets. The DMITL Project centred on case studies and consultancies.
DMITL researchers collaborated with ACSys participants who provided
the problems, data, domain expertise and final assessment of results.
Program Highlights
Developed parallel data mining algorithm, BMARS (B-splines Multivariant
Adaptive Regression Splines, and efficient and robust extension
of Friedmans MARS algorithm. The algorithm was applied to
the identification of local nonconformities in large data sets supplied
by the Australian Tax Office.
Developed a scalable and parallel discrete thin plate spline algorithm,
TPSFEM (Thin Plate Spline Finite Element Modelling). The algorithm
was applied to very large data sets associated with geological mapping
and fluid flow data sets.
Development of the PLASM (Probing Least Absolute Squares Modelling)
algorithm for variable selection in generalised additive models.
PLASM was applied to an NRMA data set.
Development of an Additive Model Workbench to allow
rapid discovery of unusual data patterns. Applied to a data set
supplied by the NRMA.
Development of feature extraction methods for time series data.
Applied to identifying stars using the MACHO (Massive Compact Halo
Objects) data set, which classifies stars by determining light intensity
using specific wavelengths.
Developed demonstrators using the specialised techniques such as
GAMMING (Generalised Additive Models) and scalable thin-plate spline
smoothing. Magnify has a licence to develop and use GAMMING software.
Conducted successful consultancies and case studies for the Australian
Tax Office, Health Insurance Commission, NRMA, Medicare, and ACT
Electricity and Water.
Awarded the High Performance Computing Challenge Award for the Terabyte
Challenge Demonstration of Distributed Data Mining at Super Computing
98.
|