Scaling Information Retrieval
In the old days, all corporate documents had to be printed in duplicate and filed in a paper-based system of dozens, if not hundreds, of cabinets and shelves.
But even with the most diligent organizing, files in this system frequently were lost, misplaced or corrupted. Many thought that with the worldwide transition to e-business, all communications and documents would be stored electronically on massive corporate servers, thus solving the problem.
As the amount of information companies produce grows exponentially, however, this ideal storage system is being put to the test.
Traditional knowledge management solutions, which generally search stored data using keywords identified the user identifies, are most effective when dealing with small quantities of data or searching for generalized information — when someone needs to quickly identify a specific document in a massive pool of data, this method is highly inefficient.
Newer semantic Web technologies address this challenge by looking at related data within a document to determine which data sets fit a specific user’s unique needs.
Natural language processing (NLP) techniques can also interpret searches phrased as direct questions and separate search results according to meaning when keywords have more than one use. For instance, these technologies would be able to distinguish whether a user searching for mice was interested in rodents or computing accessories by the way he or she phrased the query.
Scaling information-retrieval methods to accommodate both small to midsized businesses (SMBs) and enterprise operations requires storage systems that use a combination of traditional knowledge management and semantic Web technologies, said Jianhan Zhu, an enterprise and expert search researcher at the Open University in the United Kingdom.
“Web indexing and searching technologies like those used by Google can scale very well to terabytes of dataset,” he said. “On the other hand, traditional knowledge management and natural language-processing techniques can help improve the effectiveness of data retrieval. Therefore, we need to integrate Web indexing and searching technologies with domain knowledge and NLP techniques. A balance of the two strands of technology will lead to a both scalable and effective data-management and -retrieval solution.”
The security issues involved with scaling information retrieval for enterprise operations are similar to those any Web search engine faces, Zhu said. Domain-specific crawling policies, as well as content-filtering and anti-spamming techniques, can help keep an organization’s stored information secure.
Domain knowledge about the enterprise also can improve the organization’s storage and search solutions. Creating multiple levels of control and specifying user rights by individual ID can improve the security of the system, as well, he said.
The main challenge in scaling storage systems involves integrating traditional and modern technologies efficiently and effectively. Given the storage community’s interest and dedication, enterprise information-retrieval solutions soon will be able to adequately address the growing needs of enterprise businesses, Zhu said.
“An effective and efficient retrieval system based on modern information and retrieval technologies, domain knowledge and NLP techniques should be able to adapt well to larger companies, as well as SMBs,” he said. “Currently, enterprise search has attracted lots of attention from both industrial and academic communities. Given such effort, we can successfully tackle the challenges and make retrieval systems scalable to companies of all sizes.”