|
Search
IBIMA Publishing library
Open
Accesss to full-text
Articles
|
powered by

|
|
Communications of the IBIMA
Corporate Technology Intelligence Research System through Recycling
Public Patent Databases
Jose Aldo
Diaz-Prado and Arturo Lopez-Pineda and Marco Polo Cruz-Ramos
ITESM Campus Monterrey, Monterrey
Volume 2010
(2010), Article ID 592641,
Communications of the IBIMA, 10 pages.
Copyright ©
2010 Jose
Aldo Diaz-Prado Arturo Lopez-Pineda and Marco Polo
Cruz-Ramos. This is an open access article distributed under
the Creative
Commons Attribution License unported 3.0, which permits unrestricted
use, distribution, and reproduction in any medium, provided that
original work is properly cited.
Abstract
Investment
in technology is a sensitive issue in modern corporate decision making
processes. Developing software for analyzing the trends of any kind of
technology requires the extraction of patent information available
through web, but also to clean, transform and load it into a standard
database technology. In this report an ECTL process is described in
detail for analyzing the overall results and establishing a framework
for future technology intelligence research.
Keywords:
Patent analysis, Data extraction, Technology Intelligence.
One
of the main goals of businesses around the world is to develop
successful technologies and applications that can be sold in the market
by the same company or by third parties, but research and technological
developments are becoming more competitive in a global and regional
scale causing the costs of these activities to increase.
Those
companies that expect to become leaders on their fields are looking at
the time to the patent world. Patents have a very important role in the
technological development of a company, because they determine the
useful life of the technology being used, propose new development areas
and can provide a strategic advantage compared to its competitors.
Having software that can analyze the most important trends in the
technology market is an extremely important issue, since many companies
are investing huge amounts of money in the research and development of
new technologies; hence, the more they know about their investments,
the best they can predict their success.
Nevertheless, the
growth of information systems has derived in a significant increase of
the data available for possible analysis in favor of research and
economic purposes. Normally, in these data sets many anomalies arise
within them, due to the way that they were acquired: human
error,
or to a deficient storage model, etc. In order to perform information
analysis process, establish relationships, and perform clustering or
technology intelligence processes; the information must be reliable,
complete and accurate as possible, as mentioned by Oliveira et al
(2006).
The anomalies within the information can be identified
and eliminated from the stored data through the process of data
cleaning (DC). DC looks forward to overcome the problems related to the
quality and integrity of the data. This process requires of a framework
created by a knowledge expert of the data domain. Since an external
intervention is needed, the DC process is also referred as a
semi-automated process proposed by Fisher et al (2008); even though
considerable efforts have been carried out in order to try to fully
automate it, as the work carried out by Simitsis (2003).
The
remainder of this document is organized as follows: Technology
Intelligence section explains the steps needed for creating an
Information Technology Research System, and why the Extraction,
Cleaning, Transformation and Loading (ECTL) process is important. The
ECTL process section explains the extraction, cleaning, transformation
and loading process, required for building an information system
repository. Patent Information Systems section refers to the
construction process of the patent information system, detailing its
general design and database and Patent Visual Analysis explains the
graphical tool developed for the correct visualization of the patterns.
Finally, some conclusions are made; including as well a brief
description of the future work.
Technology Intelligence
In the evolution of the intelligence related to Intellectual Property
(IP), the following steps are identified:
- Patent
Search. The process of identifying specific criteria within the overall
database of patents; the purpose is to select specific patents related
to a certain technology.
- Patent Mining.
In this step companies
study their own patent portfolio and the ones from their competitors,
with the idea of identifying relationships between research and the
technology market.
- Patent
Landscaping. Involves all the previous
steps but it also adds the Patent Mapping activity, which provides a
perspective of the issues that influence certain domain of technology,
including the leading actors and its competitors, as well as the speed
of the development of new products in the market.
- Technology
Intelligence. This is a broader vision than IP, it involves
risks
and opportunities detection in the technological market. It is a tool
for decision making of the trends for research and investments that a
certain company should do.
Extraction, Cleaning,
Transformation and Loading process
Based
on the statement “Knowledge is power”, it can be inferred that having
awareness and understanding of the process a company undergoes, allows
it to take better decision and thus perform actions that have outcomes
align to the companies desires and expectations. Yet, information is
available in many facets. We are today awash in data, primarily
collected for governments and business purposes. Automation produce an
ever-growing flood of data, now feeding such vast ocean that we can
only watch the swelling tide.
All companies have to take
decisions about which actions suit best their interests. Informed
decisions – those made with knowledge of current circumstances and
likely outcomes- are more effective than uniformed decisions. Today
there are three processes for harvesting high quality-data: data
modeling, which reveals the relevant information within the collected
data ocean; data surveying, which looks at the shape of the ocean,
allowing inferring where the relevant information might be found; and
data preparation, which cleans the relevant data, by removing the dirty
data.
It is valuable to provide accurate, on time, and useful
information when addressing the companies’ strategic problems. That is
the main reason why information is vital. Nevertheless, the value of
that information is proportional to the scale of the problem it
addresses. Relevant information is expensive to collect: it takes time,
money, personnel, effort, skills, and insight knowledge of the data to
discover proper information. If the cost of discovery is greater than
the value gained, the effort is not profitable.
The
Extraction, Cleaning, Transformation, Load Process (ECTL) process is
required to construct an information system repository. Figure 2 shows
the overall process, described by Rahm and Hai Do (2000), where the
data is being taken from different sources in order to create a data
warehouse. In this Section the steps of the ECTL process are explained
in detail. Preparation of data is not a process that can be carried out
blindly. Hitherto, no fully-automatic toolbox for data cleaning
purposes have been developed, so that it could be pointed at a data set
and the dirty data could be eliminated. Probably, when artificial
intelligence techniques become more powerful that they are today,
fully-automatic data preparation toolboxes will become feasible. Until
then, data cleaning will remain as much art as science, when talking of
good data preparation tools.
Fig 2.
Extraction, Cleaning,
Transformation and Loading Process.
Extraction
The information used in this work was taken from
the U.S. Patent Office website (USPTO). Since the information is
published in a web context, the available format is HTML. Therefore,
the data must be extracted in a more manageable format, in order to
analyze it and make some inferences.
The followed steps were:
first, a query within the USPTO website of the technology that we want
to analyze was performed; in this case MEMS. The results of this query
were several HTML pages with a total of 14191 patent files shown as
links within the website. Second, for reading any specific patent file,
the desired file must be click, in order to story the gathered
information; but this manual process was automated by using a Web Data
Extraction technology. Third, some Wrappers were used in order to
detect on each of the links the main data for each patent; this data
included:
- Patent Name,
- USP
Number,
- Inventors
names,
- Assignment
name,
- Assignment
city,
- Assignment
country,
- Application
number, and
- Filed
date. Last, an XML file was
created containing all the information from each patent, separating the
important fields with specific tags.
XML is the new
standard for
information exchange and retrieval. An XML document has a schema that
defines the data definition and structure of the XML document [9] An
XML structure was selected, due to the wide acceptance of XML. A number
of techniques are required to retrieve and analyze the vast number of
XML documents. Automatic deduction of the structure of XML documents
for storing semi-structured data has been an active subject among
researchers. A number of query languages for retrieving data from
various XML data sources also have been developed. However, the use of
this query language is limited (e.g., limited type of inputs and
outputs, and users of this language should know exactly what kinds of
information are to be accessed). Data mining in the other hand, allows
the user to search to unknown facts, the information hidden behind the
data. It also enables users to pose more complex queries as described
by Duhnham (2003).
Data Cleaning
The Data Cleaning
process proposed by Muller and Freytag (2003) requires the analysis of
the following steps to verify the quality of the data:
- Parsing.
Cleaning of the data to detect syntax and grammatical errors in the
tuples or in the values of the database is carried out. This process is
analyzed by following a language compiling approach.
- Data
Transformation. Requires modifying the data from an original format to
the requested one, affecting the scheme of the tuples as well as their
values.
- Assurance of the
integrity of the restrictions. After all
the changes are made to the database, the original restrictions in
fields and tuples must still comply with the new database.
- Duplicate
Elimination. In this step, all those tuples or fields that are
duplicated, and therefore do not give additional information, are
eliminated.
Anomalies
Anomalies in the data of an information system can be classified,
according to Muller and Freytag (2003), in three groups:
- Syntactic
Anomalies. These are those that describe the format and values used to
represent an entity. In this field, it can be considered lexical
errors, like the expected size of a tuple; format errors, like the
absence of some union character; and irregularities, like the use of
different values to represent the same instance.
- Semantic
Anomalies. Anomalies where the representation of the real world is not
correctly stored in the database. Some of these anomalies are
due
to integrity, like non-consistent schemas in some of the tuples;
contradictions, like information that remarks the opposite in some
tuples or fields related among each other; duplicates, when there are
two instances of the same element; invalid tuples; that even if it
complies the previous anomalies the information does not represent the
real world.
- Coverage
Anomalies. These anomalies exist when the
stored data has an absent value in some field or some tuples that are
supposed to be within the database are missing.
Data Cleaning
Rules
When
the database structure is established, the next step is to format the
raw data, which is the data that comes from the extraction process, in
this case from a dynamic context.
Since the original information
contains several anomalies, the following criteria were followed to
clean and transform the data stored in the XML file towards mySQL
instructions. For the completion of these steps, Java language was used
to load, parse, transform and clean the data.
The rules applied
to the raw data were to extract the individual information from each
patent, using the tags <patent>, for the start and
</patent> for the ending. This step gives a Vector of $N$
individual patents to which the next rules were applied.
- Extract the name
of the patent using the tags <patent-name>, for the
beginning and </patent-name> for the ending.
- Extract the USP
number using the tags <USP-number>, for the beginning and
</USP-number> for the ending.
- Extract
the information of the inventors using the tags
<Inventor-name>,
for the beginning and </Inventor-name> for the ending,
which
gives a Vector of size $I$ of each of the inventors associated with
this patent. If there are punctuation marks at the beginning of the
name, like commas, these characters are eliminated.
- Extract the
information of the company, to which the patent was assigned, using the
tags <assigment-name>, for the beginning and
</assigment-name> for the ending. If this tag does not
exist, the
first name of the inventors list is taken as the company name. In case
there is a value for this tag, but its content is numeric, then it will
be moved to the application number field. If there are parentheses in
the name, or initial commas, these characters are eliminated.
- Extract
the information of the city of the company, to which the patent was
assigned, using the tags <assigment-city>, for the
beginning and
</assigment-city> for the ending. If there is no tag, the
default
valued as not availabel <n.a.>
- Extract the
information for
the country of the company, using the tags
<assigment-country>,
for the begining and </assigment-country> for the ending.
If
there is no defined country, the default value is USA.
- Extract the
application number of the patent, using the tags
<assigment-appl.n>, for the beginning and
</assigment-appl.n> for the ending.
- Extract the
assignment
date using the tags <assigment-filed>, for the beginning
and
</assigment-filed> for the ending. The date is translated
to a
new format in order to comply with the SQL standard yyyy-mm-dd.
Transformation
The
transformation process was executed at the same time that the data
cleaning process. Due to their close relationship, the rules for
transforming data were performed simultaneously.
Some of the steps done for transforming the data are:
- Translate
values. The date is something that was changed from the original format
to comply with the mySQL syntax.
- Splitting.
Some columns were separated into two fields, like the number of
inventors for each patent that required a variable length.
- Validation.
There were some specific values that did not correspond to the supposed
field value.
- Generating.
Some values that will be used for consistency in the database were
generated, for example assigning a default value for the country.
Loading
When
the data cleaning and transformation is concluded, the next step is to
perform the mySQL instructions, in such a way that they can be stored
in a file and loaded on a posterior date. This process is done in
parallel with the data cleaning one, creating dynamically patent per
patent.
The file is constructed writing sequentially the
information from the tables in a predefined order, due to the related
information, indexes and specific restrictions of the database. For
example to relate the USP-number with the inventor_ID in the inventions
table, it is required that both fields are already created before they
can be loaded together. The process is to load sequentially the tables
in the following order: 1) Assignee, 2) Place, 3) Assignment, 4)
Inventors, 5) Inventions.
On each INSERT update of the database
it is added the word IGNORE for consistency of non duplicate instances
and avoiding mySQL mistakes. At the end of the process, a SQL file is
created, and it contains the instructions to generate new elements of
the repository.
Patent
Information System
In order to
develop a unified system of technology intelligence, certain steps are
performed to transform the original data into useful knowledge that
helps in the decision making process.
The next steps describe this process:
- Identify the
type of data that is required for the analysis and join different
sources in a single repository.
- Clean and format
the data so that some relationships can be established between
inventors, companies, places, etc.
- Construct a
system that joins all the data and present visually the most important
information.
- The
final objective is to establish a model for a system that can be usable
in the future for the analysis of different technologies, not only for
the one that originally was built.
System Design
Designing
an adequate system for managing the corporate information related to
patent research is very important in the technology intelligence
process. The goal is to develop a software tool that let the users to
interact with it and perform an Extraction, Cleaning, Transformation,
Load Process (ECTL). In this software coexists several technologies and
it is used across platforms since it was developed in Java. It
generates views and reports of the historic patent assignments done in
the USPTO. Figure 3 depicts the information system schema.
Fig 3.
Information System.

Fig
4. Patent Database.
Within
the U.S Patent Office database, related with the MEMS field, it was
established a data structure shown in the Figure 4, where the tables
created for data storage are shown with the specific fields, primary
and foreign keys, as well as the relationships among them.
The
Places table stores the information related to the location of a
patent, in other words, it saves the city and country of the company or
person to which the patent was assigned to. The Assignees table stores
the names of the companies or people to whom a certain patent is
assigned. To each instance is created an identification number for
posterior use in other tables. The name of the authors that work in
this technology is stored in the Inventors table, where a unique
identifier is assigned to establish future relationships.
The
Patents table is the main one, since it makes a connection between the
patent number assigned by the U.S. Patent Office and the rest of the
tables. It also stores the name of the patent, its date, assignation
number and the identifiers from other tables such as the place.
The
Inventions table is an intermediate table, designed for storing the
multiple relationships between several patents and several inventors.
This table is the result of normalizing the original model where the
relationship between the inventors and patents tables was of the type
many-to-many.
Patent
Visual Analysis
Patents
are a wide field, where techniques, products, applications, and legal
considerations are highly mixed. Most of the time, this is also a field
dedicated to the industrial users and, for example, the academic
community do not cite patents very often. Nevertheless, patents are a
unique source of information since most of the data and information
published in patents are not published elsewhere. However, using and
managing a set of patents is rather complicated because most of the
tools available today are both expensive and complicated or need a
strong expertise in the field of intellectual property. According to
Dou (1997), the cost of patent databases, in which users are allow to
perform complete searches (involving a large number of patents) or to
automatically establish relationships between patents is very high and
most of the time out of the reach of middle size enterprise, academic
laboratories, or developing countries. We had the opportunity to use
patents in many circumstances and then from these uses develop a basic
knowledge to design and develop various tools to integrate patent data
in Competitive Intelligence or Competitive Technical Intelligence as
well as in innovative thinking, as presented by Quoniam et al (1993).
Furthermore, it is very important to analyze the potential of insights
that patents will provide in a graphical way. We can analyze multiple
patents in different contexts (Figure 5) such as:
Fig
5. The Corporate Technology Intelligence Research System.
- The
amount of patents that were generated over the world during a
period of time with a specific technology. This information will be
generated in a top-ten list from different countries with a specific
technology. In order to know how the scientific production over the
world has been growing or decreasing (Figure 6).
- We
also can analyze the amount of patents in MEMS technologies per
company during a period of time. This kind of analysis will show which
companies are leading the research and innovation with this technology
(Figure 7). At the same we will know where these companies are located
and who the scientists that develop such technology are.
- The
third kind of patent analysis is per country, some time the
corporation want s to know which country is leading the MEMS
technologies until today. It is important to know the leaders or the
follower of our company and where those leaders are located. The report
also include an area to do some comments in the right side of the
screen, this comments will be saved and it will generate report in a
“.pdf” format.

Fig
6. MEMS Patents Analysis for the last seven years in the USA.
Fig
7. Corporations leading the MEMS Technologies
- The
last kind of analysis that the Corporate Technology Intelligence
Research System can generate is a mixed analysis, where we apply the
technology of Tag Clouds, to give a brief review of the relevant topics
in a patent or the relevant topics in the MEMS technologies in a
country (Figure 8).
Conclusion
and Future Work
A
large literature exist about using patents to build up various indices
to R&D, to ascertain quality of inventions, to compare patent
production in various countries, to evaluate the R&D policy or
firms within or outside a country, etc. We propose to extend the
process of the patent analysis with the introduction of new
technologies such as: data mining (classification, clusterization,
etc), business intelligence and patent market analysis.
At the
same time, the Extraction, Cleaning, Transformation and Loading (ECTL)
for the required data is an important analysis topic due to its
contributions to any given information system. In addition, due to the
large amount of data stored on XML format, an automatic process is
required to identify anomalies and solve some errors.
Fig
8. TAG CLOUDS Technology for patent Analysis
The
Data Cleaning process is formed by several steps that improve the
quality of the stored data. It is very important to increase the
quality of the results obtained from a technology intelligence system,
especially if it requires managing large amounts of complex
information, which nowadays is becoming more familiar in normal
applications.
Technology intelligence is becoming the most
important trend for the intellectual property industry, especially for
those companies interested on their long term survival. The ECTL
process is the first step for developing a software tool that helps in
the decision making process, if it is done correctly the next steps
will have better results.
Furthermore, in the short future, the
purpose is to integrate data mining algorithms with XML documents to
achieve knowledge discovery. For example, after identifying
similarities among various XML documents, a mining technique can
analyze links between tags occurring together within the documents
keeping the structure within and XML document. Note that in this kind
of data mining, knowledge is inferred from the internal structure of
XML documents.
The further research on the Data Mining and
Patent Landscaping processes will be made towards the goal of creating
a Technology Intelligence research system. This will be an important
issue in the improvement of decisions regarding technology investments. References
Abiteboul, S., Buneman, P. & Sucie, D. (2000). Data on the Web: From Relations to Semistructured Data and XML, San Francisco,CA: Morgan Kaufmann. Publisher - Google Scholar
Dou,
H. (1997). 'Hearing 97 - Patents in Europe - Usage and Dissemination of
Patents as a tool to Improve SME's Strategies,' Hearing, 1997.
Proceedings on the Future Patent Politique d'information Brevets de
l'Organisation Européenne des brevets (pp.66-68). Munich, Germany:
European Patent Office. Google Scholar Duhnham, M. H. (2003). 'Data Mining: Introductory and Advanced Topics,' Upper Saddle River. NJ: Printice Hall. 2003. Google Scholar Fisher, K., Walker, D., Zhu, K.Q. & White, P. (2008). "From Dirt to Shovels: fully Automatic Tool Generation from Ad Hoc Data," SIGPLAN Not., 43(1):421-434. Publisher - Google Scholar - British
Library Direct Müller, H. & Freytag, J.C. (2003). 'Problems, Methods and Challenges in Comprehensive Data Cleansing,' Technical Report HUB-IB-164, Humboldt-Universität zu Berlin, Institut für Informatik. Berlin. Google Scholar Oliveira, P., Rodrigues, F. & Henriques, P. (2006). "An Ontology-Based Approach for Data Cleaning," Proceeding of the 11th International Conference on Information Quality, MIT, Boston, EUA, November 2006. Publisher - Google Scholar Quoniam,
L., Hassanaly, P., Baldit, P., Rostaing, H. & Dou, H. (1993).
'Bibliometric Analysis of Patent Documents for R&D Management,' Research Evaluation,(3), 13-18. April 1993. Google Scholar Rahm, E. & Hai Do, H. (2000). "Data Cleaning: Problems and Current Approaches," IEEE Data Engineering Bulletin, 23:2000, 2000. Publisher - Google Scholar Simitsis, A. (2003). "Modeling and Managing etl Processes," In: VLDB PhD Workshop, 2003. Publisher - Google Scholar Yeap,T., Hwa Loo, G. & Pang, S. (2003). "Computational Patent Mapping: Intelligent Agents for Nanotechnology," MEMS, NANO and Smart Systems, 2003. Proceedings. International Conference on, pages 274{278, July 2003.Publisher - Google Scholar
|

Article Access
|
|