Abstract:OBJECTIVE: To construct a Portuguese language index of information on the practice of diagnostic radiology in order to improve the standardization of the medical language and terminology.MATERIALS AND METHODS: A total of 61,461 definitive reports were collected from the database of the Radiology Information System at Hospital das Clínicas - Faculdade de Medicina de Ribeirão Preto (RIS/HCFMRP) as follows: 30,000 chest x-ray reports; 27,000 mammography reports; and 4,461 thyroid ultrasonography reports. The text mining technique was applied for the selection of terms, and the ANSI/NISO Z39.19-2005 standard was utilized to construct the index based on a thesaurus structure. The system was created in *html.RESULTS: The text mining resulted in a set of 358,236 (n = 100%) words. Out of this total, 76,347 (n = 21%) terms were selected to form the index. Such terms refer to anatomical pathology description, imaging techniques, equipment, type of study and some other composite terms. The index system was developed with 78,538 *html web pages.CONCLUSION: The utilization of text mining on a radiological reports database has allowed the construction of a lexical system in Portuguese language consistent with the clinical practice in Radiology.

In the medical area, particularly among Radiology and Imaging Diagnosis specialists, the preoccupation with unified systems of terminology for the clinical practice, at least in the recent decades, has become a matter of study for multidisciplinary basic and applied research(1-3). The search and later conquest of medical information harmonization represent a great advance for the area of Health. This occurs because they favor a better preparation of clinical data recordings (with greater agility, facility and lower maintenance costs), a more accurate administrative management of the stored information, a more accurate patients' history data retrieval and better management of the information for both public and private health segments(1-4).

The medical language is the type of specialty vocabulary utilized in the oral and written communication connected with the Professional practice(5). On the other hand, the medical terminology is the set of terms classified and related to expressions utilized principally in clinical documents such as imaging, laboratory and histopathological reports(1,2,4). Multiple and varied reasons lead to problems in the standardization of such terminology(2). Amongst the most relevant reasons, the following can be mentioned: the scale and multiplicity of the tasks involving the utilization of the medical terminology; the inter-clinic terminological relationship; the inter-area terminological relationship (for example, Medicine versus Nursing; Nutrition, Phonoaudiology, etc.), linguistic problems (pragmatism, neologism, orthography, redundancy, cohesion, lexical polysemy, synonyms, etc.); logic problems (generally failures in structure and narrative density), and ontological problems (how the terms are related to each other in a certain knowledge domain), besides the prospect of using the medical language on the grounds of each professional excellence level in a health care institution(1-3,5).

The work presently described is aimed at introducing and discussing the method and the results from the construction of an index(a) of information directly extracted from the clinical practice, as a possibility of simultaneously proposing a local alternative to the use of international informational standards such as the ACR Index, Radlex complementary to BI-RADS Atlas; and, also, as an attempt to minimize the issue of standardization of the medical language and its terminology for the specialty of Radiology and Imaging Diagnosis, particularly in the Portuguese language.

MATERIALS AND METHODS

Three successive development phases were established for the construction of the index. In the first phase, the data were extracted from the reports for the index structuring. For the proof of concept and, considering the initial practical impossibility to work with all the types of exams in the field of Radiology and Imaging Diagnosis, reports meeting the following criteria were included: best representation of anatomical distribution; informational complexity; and possibility of comparison with other similar studies.

Selection and extraction of data from radiological reports

A total of 61,461 definitive reports were collected from the database of the Radiology Information System at Hospital das Clínicas - Faculdade de Medicina de Ribeirão Preto (RIS/HCFMRP)(1,17-20) as follows: 30,000 chest x-ray reports; 27,000 mammography reports; and 4,461 thyroid nodules ultrasonography (US) reports. The difference in the number of reports per type of exam results from the greater or smaller demand according to the workflow in the Unit of Radiodiagnosis of the research headquarters institution. The selected images were acquired in the period from January 2000 to January 2009. Previously to the initiation of the research activities, the present study was approved by the Committee for Ethics in Research of the headquarters institution (CEP/HCFMRP), (Process CEP-HCFMRP 10791/2007)(b).

The file Oracle*dmp from the RIS/HCFMRP database required modulation to a friendly extension in order to meet the needs of the research and to facilitate the researchers' work. Reverse engineering was utilized in this process, transforming the Oracle file into Microsoft Data Base*mdb.extension. Although the Oracle database is equally friendly, the file coming from the RIS/HCFMRP included markings and structuring of proprietary database of the institutions' Center of Information and Analysis (CIA-HCFMRP). Thus, the authors opted for modulating the file into a different format with the aid of an easy-to-run, free database management software(21), just to retrieve the information from the original file, in order to facilitate the study development, reducing the time and, eventually, the costs of implementation of a mirror Oracle database professionally dedicated to the research headquarters institution.

Each report from the RIS/HCFMRP includes administrative and hospital information in a total of 12 fields to be filled. The administrative information included in the fields "performed on", "disk number", "name of the patient", "clinic code", "study status", "equipment" and "issuer" were anonymized and disregarded for the development of the index. The fields "region", "clinical suspicion", "clinic name", "report conclusion" and "report description" were considered as the source of terms for the index. Once the phase of reports extraction and collection was completed, the following phase was initiated to select the individual terms for the index.

Text mining (contents analysis and categorization process)

The text mining (contents analysis) technique(22) was utilized with a commercial, specialized software suite called Provalis Research (SimStat2.5, WordStat6.1 and QDA Miner4)(23), with academic license for the study. The text mining work consisted in importing the organized tables (similar as regards the terms origin) from the reports database into the specialized software, taking the type of each individualized exam into account. The design of the study's terminological grouping was performed at two different, but complementary moments of the technique itself application: categorization process and contents analysis. The categorization process utilized the stop-words removal code. The following stop-words were excluded from the linguistic corpus: conjunctions, numbers, special characters, unknown words (most digitization errors), articles (definite and indefinite) and prepositions.

The contents analysis(22) was aimed at extracting singular terms. Additionally, it sought the identification of the most utilized medical terms for each type of exam, by correlating six measurements of words frequency in order to get an integral keywords list. The following measurements were utilized: frequency (number of occurrences of the word), percentage (based on the total number of words retrieved by means of text mining), total percentage of words (based on the total number of words, except for those removed by the stopwords process), number of cases (number of subjects-report where a word is found) and TF*IDF - term frequency weighted by inverse document frequency.

With the application of the previously described technique, the experiments with the study's linguistic corpus resulted in a list of terms utilized on subjects-report by specialists for each type of exam (chest radiography, mammography and thyroid nodules US). Such words were grouped as follows: single medical terms (the vocabulary actually utilized by specialists); non single terms multiplied by the number of their repetition in the linguistic corpus of the study); and the total of words representing the sum of single and non single words. Then, the list of single words was reprocessed in another text mining modality, still focused on contents analysis(22), named keyword in context. Such a technique - keyword in context - delimits the term and the context where the term is found on the document, cutting out and separating a set of one to seven anterior and posterior words to the delimited term, so formalizing a semi-complex phrase. Such a methodological procedure was utilized to retrieve the most common related phrasal structures or information structures based on the terms included in the index.

At the end of the text mining application, a statistical test was performed with the group of single terms found by the contents analysis. Such statistical test was aimed at verifying the percentage of single terms in relation to the total number of reports and the hypothesis that the proportion of single terms found in the three types of exam (mammography, chest radiography and thyroid nodules US) are indeed different. For the percentage, the centesimal ratio was calculated between values corresponding to "single terms" by "total of subjects-report" for each type of exam. The proportion hypothesis was validated by means of a parametric chi-square test(21). The level of significance corresponded to 5% (p < 0.05).

Index construction

The index architecture was developed and based on a controlled vocabulary focused on a knowledge representation system called thesaurus. For such a purpose, the authors selected a standard with international coverage and an interdisciplinary approach updated and compatible with the operational reality of systems of health information. And also, as a complement, it should be grounded on the theories of Faceted Classification, Concept and Terminology. The standard which met the planned approach and, therefore, utilized in the present study, was the American National Standard/National Information Standards Organization - Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies Z39. 19-2005 (ANSI/NISO Z39.19-2005)(19). The relationships supported by the index structure are shown on Table 1.

Because of the amount of terms and words in the study, a semi-automatic software [MultiTes Pro, academic license, version (usage authorization) 2008/2009 and 2010/2011] was utilized for the index construction. Such software not only performs the terms linkage, but also generates an initial structure in web language as a formalization of the accomplished work. The native option of the MultiTes Pro utilized for structure extension was the HyperText Markup Language, *html. Among others, such a language was selected because it is the root-format of generation emitted by the MultiTes Pro. After feeding the index with the terms and words, with subsequent generation in a webbrowsable system format, the screens underwent a last modeling process to be used in a school hospital. The developed tool was characterized with basis on a comparative analysis of the requisites, functionality and scope between the locally developed index and the most relevant lexicons in the area, namely, ACR Index, BI-RADS Atlas and Radlex.

RESULTS

The text mining as a first application of terminological extraction resulted in 11,210,832 (n = 100%) words, constituting a semantic complex utilized in the 61,461 reports included in the present study. After selection and routine stopwords removal, a total of 358,236 (n = 3.19%) words was reached. With the application of the text mining techniques in order to delimit the specificity of the words, 24,488 (n = 0.21%) single terms could be selected. The single terms, with the application of the context word routine, made 51,859 (n = 0.46%) information structures (semi-complex phrases) available. The index construction was completed with 76,347 (n = 0.68%) words and semi-complex phrases (vocabulary terms). Such terms are divided into descriptive anatomic terms of pathology, of imaging technique, of equipment, of type of exam and composite terms (and, therefore, repeated in the index), considering the addition of another specialty term and/or word grammatically necessary for its understanding (connective, for example: of, with, for, etc.). The index system was developed with 78,538 fully-browsable web *html pages and with possibility of terms linkage with the operational system of reports or with any eventually created system supporting a compatible language, including 2,191 pages for terms input and navigation structure (initial and feedback screens) and 76,347 pages dedicated to each individual term.

The chi-square test revealed that the words distribution proportion for each different type of exam (Table 2) and the percentage of single terms per type of report presented asymmetry for the exams included in the present study (Table 3). The comparative analysis of requisites, functionality and scope, between the terminology of the area and the presently described index, revealed a considerable degree of appropriateness as regards the characteristics of such international tools, as demonstrated by Table 4.

DISCUSSION

Text mining application

The contents analysis process(22) may be described as a systematic, objective and quantitative method of analysis of the characteristics of a message. The most relevant advantages in the application of contents analysis correspond to the direct observation of the following aspects: the validity, interpretation and explanation on how such information are formalized in a given data set in a long time span(22). Furthermore, there is the analysis of information produced by a heterogeneous specialists' community (which may include professions in areas other than Medicine). In the case of the present study, the RIS information is produced by resident physicians, contracted specialists and also medical professors. Differently from studies more focused on computational or methodological and conceptual aspects of text mining, the fundamental view described in the present study on the application of the technique, as indicated by the results (Tables 2 and 3), is that it allows to establish possible terminological categories for the clinical and pedagogical use and, eventually may also facilitate processes of development of softwares in Portuguese language, compatible with the radiological practice in Brazil.

From the point of view of development of a semantic corpus with possibility of serving as basis for the creation of a standard for medical vocabularies, the results presented on Tables 2 and 3 show different terminological distributions for each type of report. Such heterogeneity hinders the development of a single model of extraction and formation of medical dictionaries or automatic and comprehensive descriptive diagnostic standards, requiring an individualized observation of each type of exam and technique for creation of a terminological extraction design. It is important to highlight that the results presented as whole on Tables 2 and 3 confirm that the utilization of a diagnostic standard, such as the BI-RADS, clearly provide a decrease in the utilization of different terms for a single type of report.

The problematic of use of information

Firstly, the present study results demonstrate that a database of a RIS, in general longitudinal and representative of a unit of Radiology and Imaging Diagnosis, may serve as a tool for constructing intensive knowledge systems. Despite the explicity of this later assertion for specialists in Computer Sciences and in Engineering in general, such detail regarding the finding about a RIS database is opportune a necessary for the present study as it alerts the radiologists community on the scientific, and even corporative and financial potential that is present in databases at each Brazilian clinic and hospital. They may even serve as a substitute for conventional lexical systems which, almost in their totality, are written in foreign language. Such aspect may favor the process of teaching the correct use of the specialty terminology, with direct repercussion on the radiologists' product, i.e., the radiological report, and on the communication among medical specialists. Additionally, they allow for a transposition of lexical systems and their utilization in the assistance in a less mechanical and more flexible manner, since the information coming from RIS databases for clinical use represent a daily professional practice (guarantee of use). In terms of harmonization of use of medical terminology, the guarantee of use allows us to cope with two problems, namely, the scale and multiplicity of tasks involving the use of the medical terminology. The usage scale is a "X" amount of information produced by a "Y" number of individuals. On the other hand, multiplicity of tasks is a "X" amount of information produced for a "Y" number of objectives and distinct tasks. Such two conditions where a lot of information is produced by a high number of individuals for several objectives, either complementary or not, favor the emergence of the information inconsistency problems mentioned in the introduction of the present article. The text mining method utilized in the present study allows for the establishment of limits regarding the variability of terms utilized in reports, reducing the occurrence of disparities in the narrative accuracy of the text and developments connected with the usage scale and multiplicity of tasks. Then, it is possible to reduce to a minimum the terminological set available in the index. Thus, the reductions resulting from the described method allow for an usage scale and multiplicity of radiological information tasks based on a controlled and stable environment in relation to the applied terminology, favoring the decrease in the occurrence of linguistic problems(1).

Comparison of the index with the ACR Index, BI-RADS Atlas and Radlex

The ACR Index (Table 2) is a system organized with terms originated from anatomy and pathology, potentially utilized by radiologists in the description of radiological findings. In such system, the terms receive a code (two to four digits for anatomy terms, and two to five digits for pathology terms) separated by a dot delimitating their origin, firstly indicating location (anatomy) and subsequently the lesion or condition (pathology)(6,20,24). Such coding allows a set of up to ten digits to formalize an informational reference. Such standardization based on a decimal classification system above all serves to retrieve information. Differently from a proper terminology which offers complex semantic relationships and which may be useful in the modeling of electronic information systems, with inference trees and descriptive dictionaries. Both the ACR Index and BI-RADS Atlas present a limited set of terms, without complex and organized relationships, in order to facilitate the utilization by the user, of a singular term and not a set of terms. An example of such aspect is the use of the ACR Index. For a reference to the term "calcic tendinitis of the supraspinatus muscle" (Figure 2), it is necessary to consult the skeletal system (shoulder girdle and arm), select the most proximal anatomical area which, in this case, is the shoulder joint, whose code is 414.; and subsequently combine with the pathology which, on Figure 2 is calcic tendinitis, described in the group of periarticular and articular soft tissue inflammation, code .253 in the ACR Index. Such group includes only the option "tendinitis" and its typification as calcified and Pelegrini-Stieda syndrome. Thus, according to the ACR Index logics, calcic tendinitis of the supraspinatus muscle (Figure 2) would receive the diagnostic code 414.253. As compared with such a model, the index developed and described in the present study allows for a different approach, with differentiated characteristics for clinical use, namely, the search for the term is simplified, since the user just needs to access the alphabetical index (Figure 1) and select the letter "t" to access the term "tendinitis". Under the letter "t", the user finds the term "tendinitis" as a single term and all the phrasal formations (structures) utilized by specialists. With two clicks it is possible to retrieve any information related to the term "tendinitis" classified in the index. Another improvement is related to the use of decimal classification. The coding utilized by the ACR Index, besides being difficult to be assimilated by users, causes a terminological reduction of the description.

Figure 1. Initial index screen, with the terms entry index itself and general information on the system.

The BI-RADS Atlas (Table 2), on its turn, is a system specific for standardization of the description and conclusion of mammographic reports(7,25). It is widely utilized in the practice of Radiology and Imaging Diagnosis, as well as by correlated areas involved in diagnostic investigation. Its functioning model emphasizing the standardization of findings, descriptive terms and possible conclusions, serves as a basis for other similar initiatives as regards gains in terms of quality of the information and, consequently, quality of the diagnosis(7). Among the lexical systems discussed in the present study, BI-RADS is the only to allow the simultaneous utilization with the developed index, since it is utilized in mammography reporting and is included in the set of terms of the index.

On the other hand, Radlex (Table 2) is the most recent amongst the three mentioned systems. Its development was initiated in the middle of the last decade as a response to the limitations imposed by the classification with the codified use of the ACR Index(8,27). The Radiological Society of North America (RSNA) has proposed the expansion and terminological review of the ACR Index tree. For this purpose, the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT®)(8) was utilized. With the terminology review, expansion and re-design into a new structure, the Radlex allows for the use of information related to devices, procedures and imaging techniques in Radiology and Imaging Diagnosis; descriptive terms on the difficulty in the perception and analysis of interpretation and diagnostic quality of images(8,27). From the technological point of view, the Radlex allows the manipulation of the system information in several ways. One of such ways is the possibility of exporting its structure to the Open Document Format (*odf) by means of the extensible markup language (*xml) and resource description framework (*rdf), since both of them are included in the *owl language syntaxis. The extension *odf is a file format readable by several computer programs, among them, Google Docs, IBM® Lotus® SimphonyTM and OpenOffice.org. Considering that it is designed as ontology, it also allows for the automatic and scalable interoperability of health information systems within the terminology coverage domain. It is the lexical system most similar to the presently described index. The difference in relation to the presently described index is that the terms included in the Radlex are units, without the composition of phrasal structures, which tends to limitate its use in references with greater terminological complexity. Differently from the construct index, the reference tree of Radlex may also present some obstacles to access, since it utilizes the separation of descriptive terms of anatomy and pathology in their determined terminological niches, differently from the constructed index. Considering that it is developed as an ontology model, which is a system of knowledge representation more adaptable than the thesaurus, with Radlex it is possible to import just the parts of interest from its terminological structure, for example, a given anatomical area or a set of pathologies of a same type, which does not occur in the index, since it is not possible to export information from its structure (only terms can be imported), increasing its database with new terms. On the other hand, the developed index presents the great advantage of allowing the direct relationship between terms and the clinical practice in the reporting process in Portuguese language.

Study limitations and future challenges

The choice for a thesaurus to organize the information collected from reports has shown to be partially effective in dealing with logic problems related to the use of information. The semi-complex phrases (up to seven words) articulated with terms included in the database and constructed with text mining techniques in contents analysis (keyword and keyword in context) are, in truth, structures of information. However, it is not possible to construct completely structured reports by simply utilizing the index and such phrases without the implementation of additional automation technologies involving programming resources and studies about the current healthcare communication protocols. Additionally, the validation of a structured radiological report demands the construction of a specific tool with a comprehensive group of specialized observers available to complete such a task(30). Therefore, one could not assure that the method and the index of the present study serve the purpose of solving and/or attenuate problems related to structural failures and narrative density. But one may consider that the present study describes viable methods to achieve such a purpose.

As regards ontological problems, the formalization of relationships like those utilized in the present study (Table 1) establishes a versatile form of reference/use of the semantic set displayed on the index. This is because it considers comprehensively three different types of relationships and their logical developments (Table 1). The equivalence relation allows for the option of use/reference by the radiologist for the best term among a set of synonyms. The hierarchical relationship allows for the selection of a term in a category or closed group of terms (set of terms related to a determined anatomical region, pathology or other formalizations of groups and/or subgroups of terms). The associative relationship allows that terms related to areas different and/or distant as regards hierarchy and equivalence (antonyms) may be organized due to diagnostic associations (causeeffect ratio) (Table 1), for example. However, even considering the versatility in the construction of relationships, a thesaurus does not favor the establishment of relationships automatically processable by machines(1,9). Ontology is an appropriate knowledge representation system which allows for computer processing(18) and also establishes robust inference rules for the medical practice(31-34). It is the model on which the Radlex, for example, was developed.

An ontology(27) is also the most indicated knowledge representation system to get semantic interoperability between healthcare information systems(35,36). It offers a possibility of solution for problems connected with interclinic communication in Medicine and health inter-areas, allowing that heterogeneous systems communicate in a multiple manner and with formalization of complex information. This does not occur with the developed index, since its structure cannot stand a communication protocol between electronic healthcare systems. On the other hand, the established informational organization(37-40) should facilitate the construction of an ontology model in Portuguese language compatible with the clinical practice.

Despite the technological limitations associated with de model of index utilized in the present study, the prospect of development of information tools which may friendly and transparently help radiologists, particularly the Brazilian community of radiologists, in their daily clinical practice, seems to be promising. The evidence that supports such assertion is that ontology is an advanced computational instrument which can be developed from a thesaurus (that is an index structure). Additionally, the information technology apparatuses represent the second greater category of research & development in the area of Radiology and Imaging Diagnosis, promoting technological innovation both at national and international levels(41). The index developed by the present study is an intermediate tool which may serve as basis for a series of applications with repercussions on education, research and assistance, with potential utilization in the modeling of innovative technological inputs, particularly for the Brazilian reality(1,9,12,36,42-44).

CONCLUSION

The study presented a method for information extraction from radiological reports, allowing the construction of a terminological system in Portuguese language on the grounds of the practice of Radiology and Imaging Diagnosis.

Acknowledgements

To Fundação de Amparo à Pesquisa do Estado de São Paulo (Fapesp), Fundação de Apoio ao Ensino, Pesquisa e Assistência do Hospital das Clínicas de Ribeirão Preto (Faepa-HCFMRP). and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), for the financial support. To Prof. Dr. Jorge Elias Júnior, for his collaboration regarding peculiarities and diagnostic descriptive differences between chest radiography and thyroid ultasonography reports; to Dr. Normand Péladeu, of Provalis Research, for his assistance for understanding the utilization of the data mining tool, and also for the valuable indication of bibliography; to CKE applications, for the web design work.

Study developed at Faculdade de Medicina de Ribeirão Preto da Universidade de São Paulo (FMRP-USP), Ribeirão Preto, SP, Brazil.
(a) Index is a comprehensive term, both in Portuguese and in English, meaning(15,16) "index of selected information essentially serving the purpose of allowing or facilitating the retrieval of any type of record of knowledge either by physical or electronic means".
(b) The present study was planned and conducted in compliance with the research integrity guidelines included in the Code of Good Scientific Practices"(17) for Fundação de Amparo à Pesquisa do Estado de São Paulo (Fapesp) beneficiaries and scholarship holders.