Contact

Simon Hawkins
CSIRO Mathematical and Information Sciences
GPO Box 664
Canberra ACT 2601
Phone: 02 6216 7010
simon.hawkins@cmis.csiro.au

 


Data Mining Program

The Data Mining (DM) Program carried out research to improve techniques for managing and interpreting large and complex data sets. Organisations increasingly use data mining to analyse their data to solve such business problems as risk management, market segmentation and fraud detection. Data mining allows for a better understanding of market and client behaviour, and can often lead to gains in competitive advantage.

Data mining requires specialist expertise from disciplines such as statistics, machine learning and visualisation. Algorithms must be developed for automating data sorting and verification tasks, for visualising and exploring data sets, and for identifying patterns and anomalous groups in data sets. To bring focus to these activities, the DM Program set up three projects:

  • Numerical Algorithms (NumAlg)

  • Parallel Algorithms (ParAlg)

  • Data Mining in The Large (DMITL)


NumAlg was the precursor to ParAlg, both of which aimed to develop state-of-the-art vector/parallel algorithms and implement them in high quality transportable code. These algorithms made efficient use of parallel computers and helped identify non-conformities in large data sets. The DMITL Project centred on case studies and consultancies. DMITL researchers collaborated with ACSys participants who provided the problems, data, domain expertise and final assessment of results.


Program Highlights

Developed parallel data mining algorithm, BMARS (B-splines Multivariant Adaptive Regression Splines, and efficient and robust extension of Friedman’s MARS algorithm. The algorithm was applied to the identification of local nonconformities in large data sets supplied by the Australian Tax Office.


Developed a scalable and parallel discrete thin plate spline algorithm, TPSFEM (Thin Plate Spline Finite Element Modelling). The algorithm was applied to very large data sets associated with geological mapping and fluid flow data sets.

Development of the PLASM (Probing Least Absolute Squares Modelling) algorithm for variable selection in generalised additive models. PLASM was applied to an NRMA data set.


Development of an ‘Additive Model Workbench’ to allow rapid discovery of unusual data patterns. Applied to a data set supplied by the NRMA.


Development of feature extraction methods for time series data. Applied to identifying stars using the MACHO (Massive Compact Halo Objects) data set, which classifies stars by determining light intensity using specific wavelengths.


Developed demonstrators using the specialised techniques such as GAMMING (Generalised Additive Models) and scalable thin-plate spline smoothing. Magnify has a licence to develop and use GAMMING software.


Conducted successful consultancies and case studies for the Australian Tax Office, Health Insurance Commission, NRMA, Medicare, and ACT Electricity and Water.


Awarded the High Performance Computing Challenge Award for the Terabyte Challenge Demonstration of Distributed Data Mining at Super Computing 98.