Febrl does data standardisation and probabilistic record linkage of one or more files or data sources.
Record or data linkage techniques are used to link together records which relate to the same entity (e.g. patient, customer, household) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be linked.
Record linkage is an important initial step in many research and data mining projects in the biomedical and other sectors, where it is used to improve data quality and to assemble longitudinal or other data sets which would not otherwise be available.
The ANU Data Mining Group is currently working in collaboration with the Centre for Epidemiology and Research at the NSW Department of Health on the improvement of record linkage techniques and software. We are particularly interested in advancing the development of two aspects of record linkage:
- The use of high performance computing (HPC) techniques in order to make linkage of huge data sets faster (or feasible) when using parallel computing platforms, such as clusters of workstations or personal computers, or supercomputers such as the APAC National Facility (a Compaq high-performance platform with 480 processors).
- The use of machine learning and data mining techniques in order to improve upon the accuracy of the linkage and to reduce the very considerable human effort required to link very large data sets.
We have started developing prototype software which undertakes data standardisation, which is an essential pre-processing phase for most record linkage projects, and which implements the "classical" approach to probabilistic record linkage model as described by Fellegi and Sunter (I. Fellegi and A. Sunter, A theory for record linkage. Journal of the American Statistical Association, 1969) and subsequently extended by others. We hope that this prototype software will be of immediate use to biomedical and other researchers.
We plan to use that software as a platform for exploring various parallel computing and machine learning techniques. To our knowledge, no parallel implementation for probabilistic record linkage is currently available. Issues to be explored include data distribution, blocking techniques, parallel preprocessing and load balancing. Although a number of machine learning and other classification techniques have been applied to the record linkage problem over the last few years, no-one has yet focused on using these techniques to reduce or eliminate the time consuming and tedious manual clerical review process which is needed to decide the status of possible or doubtful links between records.
The prototype software is published under a free, open source software license in order to promote collaboration and to encourage others to contribute to the development and maintenance of the software. The tools we are using are also all free, open source software, namely the object-oriented programming language Python and associated extension libraries.
In order to gain an appreciation of the wide range of uses of record linkage in biomedical research, click here to perform a PubMed search for the term "Medical Record Linkage".
In a hurry? Add it to your Download Basket!
What's New in This Release:
- This is another bug fix release for Febrl 0.4. It does not contain new features in the main Febrl program modules, but it does include a much improved new data set generator (generate2) which allows more realistic data generation, enables attribute dependencies, and also facilitates the generation of family and household data.