Common-sense reasoning has recently emerged as an important test for artificial general intelligence, especially given the much-publicized successes of language representation models such as T5, BERT and GPT-3. Currently, typical benchmarks involve question answering tasks, but to test the full complexity of common-sense reasoning, more comprehensive evaluation methods that are grounded in theory should be developed.
Knowledge Graph Construction from Data, Data Dictionaries, and Codebooks: the National Health and Nutrition Examination Surveys Use Case
Santos, Henrique, Pinheiro, Paulo, and McGuinness, Deborah L.
CDC’s National Health and Nutrition Examination Surveys (NHANES) is a continuous survey that aims to study the relationship between diet, nutrition, and health and their roles in designated population subgroups with selected diseases and risk factors. Data is acquired using questionnaires (either by human interviewers or computer-assisted), aimed at collecting data about participants’ households and families, medical conditions, substance usage, and more. NHANES data and supporting documentation, including data dictionaries (DDs) and codebooks (CBs), are made publicly available and are used in many data science efforts to support a wide range of health informatics projects. A typical use of NHANES data requires a complex human interpretation of the data with the help of the DDs and CBs. For example, to retrieve “diseases treated by a specific drug in households with annual income under $20,000”, one should select all the relevant variables (diseases, drugs, household income, participants) across the relevant datasets (demographic, drug usage) and perform a series of transformations (normalizing disease and income codes) to generate the answer for the query. During data processing, it is not uncommon for data to be misinterpreted as NHANES may use the same variable for multiple purposes (e.g. the same variable is used for diseases being treated and diseases being prevented by a drug and sometimes this distinction is critical to applications). Furthermore, the result of this processing may be incorrectly combined (e.g., harmonized with new data, from NHANES or other studies).
We present our approach for translating NHANES’ datasets, metadata, and any additional documentation from the surveys into a rich knowledge graph (KG) that maintains semantic distinctions. We leverage the Human-Aware Data Acquisition Infrastructure (HADatAc) and its underlying Human-Aware Science Ontology (HAScO), to systematically represent the complete data acquisition process. Semantic Data Dictionaries (SDDs), which are derived from DDs and CBs, support the elicitation of objects that are not directly represented within NHANES datasets (including household, the household reference person, drug usage for disease treatment, drug usage for disease prevention, etc.). We demonstrate how we use the KG to generate tailored datasets based on user choice of variables and alignment criteria across multiple NHANES datasets. Our use of SDDs enables the combined use of ontologies and data. We further demonstrate that once data is encoded into the KG, the KG can be used to support complex automated data harmonization that until now, when required in any kind of meta-analysis study based on NHANES, is still done manually.
2021
Exploring and Analyzing Machine Commonsense Benchmarks
Santos, Henrique, Gordon, Minor, Liang, Zhicheng, Forbush, Gretchen, and McGuinness, Deborah L.
In Proceedings of the Workshop on Common Sense Knowledge Graphs Sep 2021
Commonsense question-answering (QA) tasks, in the form of benchmarks, are constantly being introduced for challenging and comparing commonsense QA systems. The benchmarks provide question sets that systems’ developers can use to train and test new models before submitting their implementations to official leaderboards. Although these tasks are created to evaluate systems in identified dimensions (e.g. topic, reasoning type), this metadata is limited and largely presented in an unstructured format or completely not present. Because machine common sense is a fast-paced field, the problem of fully assessing current benchmarks and systems with regards to these evaluation dimensions is aggravated. We argue that the lack of a common vocabulary for aligning these approaches’ metadata limits researchers in their efforts to understand systems’ deficiencies and in making effective choices for future tasks. In this paper, we first discuss this MCS ecosystem in terms of its elements and their metadata. Then, we present how we are supporting the assessment of approaches by initially focusing on commonsense benchmarks. We describe our initial MCS Benchmark Ontology, an extensible common vocabulary that formalizes benchmark metadata, and showcase how it is supporting the development of a Benchmark tool that enables benchmark exploration and analysis.
An experimental study measuring human annotator categorization agreement on commonsense sentences
Santos, Henrique, Kejriwal, Mayank, Mulvehill, Alice M., Forbush, Gretchen, and McGuinness, Deborah L.
Developing agents capable of commonsense reasoning is an important goal in Artificial Intelligence (AI) research. Because commonsense is broadly defined, a computational theory that can formally categorize the various kinds of commonsense knowledge is critical for enabling fundamental research in this area. In a recent book, Gordon and Hobbs described such a categorization, argued to be reasonably complete. However, the theory’s reliability has not been independently evaluated through human annotator judgments. This paper describes such an experimental study, whereby annotations were elicited across a subset of eight foundational categories proposed in the original Gordon-Hobbs theory. We avoid bias by eliciting annotations on 200 sentences from a commonsense benchmark dataset independently developed by an external organization. The results show that, while humans agree on relatively concrete categories like time and space, they disagree on more abstract concepts. The implications of these findings are briefly discussed.
Geospatial Reasoning with Shapefiles for Supporting Policy Decisions
Santos, Henrique, McCusker, James P., and McGuinness, Deborah L.
In Proceedings of the 4th International Workshop on Geospatial Linked Data (GeoLD 2021) Sep 2021
Policies are authoritative assets that are present in multiple domains to support decision-making. They describe what actions are allowed or recommended when domain entities and their attributes satisfy certain criteria. It is common to find policies that contain geographical rules, including distance and containment relationships among named locations. These locations’ polygons can often be found encoded in geospatial datasets. We present an approach to transform data from geospatial datasets into Linked Data using the OWL, PROV-O, and GeoSPARQL standards, and to leverage this representation to support automated ontology-based policy decisions. We applied our approach to location-sensitive radio spectrum policies to identify relationships between radio transmitters coordinates and policy-regulated regions in Census.gov datasets. Using a policy evaluation pipeline that mixes OWL reasoning and GeoSPARQL, our approach implements the relevant geospatial relationships, according to a set of requirements elicited by radio spectrum domain experts.
Towards a Domain-Agnostic Computable Policy Tool
Falkow, Mitchell,Â
Santos, Henrique, and McGuinness, Deborah L.
Policies are often crucial for decision-making in a wide range of domains. Typically they are written in natural language, which leaves room for different individual interpretations. In contrast, computable policies offer standardization for the structures that encode information, which can help decrease ambiguity and variability of interpretations. Sadly, the majority of computable policy frameworks are domain-specific or require tailored customization, limiting potential applications of this technology. For this reason, we propose ADAPT, a domain-agnostic policy tool that leverages domain knowledge, expressed in knowledge graphs, and employs W3C standards in semantics and provenance to enable the construction, visualization, and management of computable policies that include domain knowledge to reduce terminology inconsistencies, and augment the policy evaluation process.
2020
The Semantic Data Dictionary – An Approach for Describing and Annotating Data
Rashid, Sabbir M., McCusker, James P., Pinheiro, Paulo, Bax, Marcello P.,Â
Santos, Henrique, Stingone, Jeanette A., Das, Amar K., and McGuinness, Deborah L.
It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a data set, existing data dictionaries typically are not machine-readable and do not follow a common specification standard. We introduce the Semantic Data Dictionary, a specification that formalizes the assignment of a semantic representation of data, enabling standardization and harmonization across diverse data sets. In this paper, we present our Semantic Data Dictionary work in the context of our work with biomedical data; however, the approach can and has been used in a wide range of domains. The rendition of data in this form helps promote improved discovery, interoperability, reuse, traceability, and reproducibility. We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature. We discuss our approach, present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey data set, present modeling challenges, and describe the use of this approach in sponsored research, including our work on a large National Institutes of Health (NIH)-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics, Learning, and Semantics project. We evaluate this work in comparison with traditional data dictionaries, mapping languages, and data integration tools.
A Semantic Framework for Enabling Radio Spectrum Policy Management and Evaluation
Santos, Henrique, Mulvehill, Alice, Erickson, John S., McCusker, James P., Gordon, Minor, Xie, Owen, Stouffer, Samuel, Capraro, Gerard, Pidwerbetsky, Alex, Burgess, John, Berlinsky, Allan, Turck, Kurt, Ashdown, Jonathan, and McGuinness, Deborah L.
Because radio spectrum is a finite resource, its usage and sharing is regulated by government agencies. These agencies define policies to manage spectrum allocation and assignment across multiple organizations, systems, and devices. With more portions of the radio spectrum being licensed for commercial use, the importance of providing an increased level of automation when evaluating such policies becomes crucial for the efficiency and efficacy of spectrum management. We introduce our Dynamic Spectrum Access Policy Framework for supporting the United States government’s mission to enable both federal and non-federal entities to compatibly utilize available spectrum. The DSA Policy Framework acts as a machine-readable policy repository providing policy management features and spectrum access request evaluation. The framework utilizes a novel policy representation using OWL and PROV-O along with a domain-specific reasoning implementation that mixes GeoSPARQL, OWL reasoning, and knowledge graph traversal to evaluate incoming spectrum access requests and explain how applicable policies were used. The framework is currently being used to support live, over-the-air field exercises involving a diverse set of federal and commercial radios, as a component of a prototype spectrum management system.
The Dynamic Spectrum Access Policy Framework in Action
Santos, Henrique, Mulvehill, Alice M., Erickson, John S., McCusker, James P., Gordon, Minor, Xie, Owen, Stouffer, Samuel, Capraro, Gerard, Pidwerbetsky, Alex, Burgess, John, Berlinsky, Allan, Turck, Kurt, Ashdown, Jonathan, and McGuinness, Deborah L.
In ISWC 2020 Posters, Demos, and Industry Tracks Apr 2020
Because radio spectrum is a finite resource, its usage and sharing is regulated by government agencies through policies that manage spectrum allocation. With more portions of the spectrum being licensed for commercial use, the importance of providing an increased level of automation when evaluating such policies becomes crucial for the efficiency and efficacy of spectrum management. This poster showcases the Dynamic Spectrum Access Policy Framework, which acts as a machine-readable policy repository providing policy management features and spectrum access request evaluation. It includes the use of the framework’s policy management capabilities to create and modify policies in a novel policy representation using two recommended web standards (OWL and PROV-O), and the request evaluation engine to verify the assignment of permit/deny effects to spectrum requests.
2018
Annotating Diverse Scientific Data with HAScO
Pinheiro, Paulo, Bax, Marcello,Â
Santos, Henrique, Rashid, Sabbir M., Liang, Zhicheng, Liu, Yue, McCusker, James P., and McGuinness, Deborah L.
In Proceedings of the Seminar on Ontology Research in Brazil 2018 (ONTOBRAS 2018). SĂŁo Paulo, SP, Brazil Apr 2018
Ontologies are being widely used across many scientific fields, most notably in roles related to acquiring, preparing, integrating and managing data resources. Data acquisition and preparation activities are often difficult to reuse since they tend to be domain dependent, as well as dependent on how data is acquired: through measurement, subject-elicitation, and/or model-gen}-er}-ation activities. Therefore, tools developed for preparing data from one scientific activity often cannot be easily adapted to prepare data from other scientific activities. We introduce the Human-Aware Science Ontology (HAScO) that integrates a collection of well-established science-related ontologies, and aims to address issues related to data annotation for large data ecosystem, where data can come from diverse data sources including sensors, lab results, and questionnaires. The work reported in the paper is based on our experience developing HAScO, using it to annotate data collections to facilitate data exploration and analysis for numerous scientific projects, three of which will be described. Data files produced by scientific studies are processed to identify and annotate the objects (a gene, for instance) with the appropriate ontological terms. One benefit we realized (of preserving scientific data provenance) is that software platforms can support scientists in their exploration and preparation of data for analysis since the meaning of and interrelationships between the data is explicit.
HADatAc: A Framework for Scientific Data Integration using Ontologies
Pinheiro, Paulo,Â
Santos, Henrique, Liang, Zhicheng, Liu, Yue, Rashid, Sabbir M., McGuinness, Deborah L., and Bax, Marcello P.
In Proceedings of the ISWC Posters & Demonstrations Track Apr 2018
To investigate the cause and progression of a phenomenon, such as chronic disease, it is essential to collect a wide variety of data that together explains the complex interplay of different factors, e.g., genetic, lifestyle, environmental and social. Sharing information between studies is therefore of paramount importance. However, data that needs to be analyzed must be appropriately integrated, conceptually aligned, and harmonized. This implies that data collection must be done either in a sufficiently similar or a sufficiently transparent way in order to support meaningful synthesis from different studies. We will demonstrate how the Human-Aware Data Acquisition (HADatAc) framework integrates and harmonizes data from multiple scientific studies and thus how to use it in interdisciplinary science investigations.
2017
From Data to City Indicators: A Knowledge Graph for Supporting Automatic Generation of Dashboards
Santos, Henrique, Dantas, Victor, Furtado, Vasco, Pinheiro, Paulo, and McGuinness, Deborah L.
In the context of Smart Cities, indicator definitions have been used to calculate values that enable the comparison among different cities. The calculation of an indicator values has challenges as the calculation may need to combine some aspects of quality while addressing different levels of abstraction. Knowledge graphs (KGs) have been used successfully to support flexible representation, which can support improved understanding and data analysis in similar settings. This paper presents an operational description for a city KG, an indicator ontology that support indicator discovery and data visualization and an application capable of performing metadata analysis to automatically build and display dashboards according to discovered indicators. We describe our implementation in an urban mobility setting.
Enabling Data Analytics from Knowledge Graphs
Santos, Henrique
In Proceedings of the Doctoral Consortium at the 16th International Semantic Web Conference (ISWC 2017) May 2017
Scientific data is being acquired in high volumes in support of studies in many knowledge areas. Regular data analytics processes make use of datasets that often lack enough knowledge to facilitate the work of data scientists. By relying on knowledge graphs (KGs), those difficulties can be mitigated. This research focuses on enabling data analytics over scientific data in light of knowledge available in KGs, providing access, based on queries, to scientific data points in KGs to data users while making use of available knowledge to facilitate their data analytics activities.
2015
Contextual Data Collection for Smart Cities
Santos, Henrique, Furtado, Vasco, Pinheiro, Paulo, and McGuinness, Deborah L.
In Proceedings of the Sixth Workshop on Semantics for Smarter Cities Oct 2015
As part of Smart Cities initiatives, national, regional and local governments all over the globe are under the mandate of being more open regarding how they share their data. Under this mandate, many of these governments are publishing data under the umbrella of open government data, which includes measurement data from city-wide sensor networks. Furthermore, many of these data are published in so-called data portals as documents that may be spreadsheets, comma-separated value (CSV) data files, or plain documents in PDF or Word documents. The sharing of these documents may be a convenient way for the data provider to convey and publish data but it is not the ideal way for data consumers to reuse the data. For example, the problems of reusing the data may range from difficulty opening a document that is provided in any format that is not plain text, to the actual problem of understanding the meaning of each piece of knowledge inside of the document. Our proposal tackles those challenges by identifying metadata that has been regarded to be relevant for measurement data and providing a schema for this metadata. We further leverage the Human-Aware Sensor Network Ontology (HASNetO) to build an architecture for data collected in urban environments. We discuss the use of HASNetO and the supporting infrastructure to manage both data and metadata in support of the City of Fortaleza, a large metropolitan area in Brazil.
Human-Aware Sensor Network Ontology: Semantic Support for Empirical Data Collection
Pinheiro, Paulo, McGuinness, Deborah L., and Santos, Henrique
In Proceedings of the 5th Workshop on Linked Science. Bethlehem, PA, USA Oct 2015
Significant efforts have been made to understand and document knowledge related to scientific measurements. Many of those efforts resulted in one or more high-quality ontologies that describe some aspects of scientific measurements, but not in a comprehensive and coherently integrated manner. For instance, we note that many of these high-quality ontologies are not properly aligned, and more challenging, that they have different and often conflicting concepts and approaches for encoding knowledge about empirical measurements. As a result of this lack of an integrated view, it is often challenging for scientists to determine whether any two scientific measurements were taken in semantically compatible manners, thus making it difficult to decide whether measurements should be analyzed in combination or not. In this paper, we present the Human-Aware Sensor Network Ontology that is a comprehensive alignment and integration of a sensing infrastructure ontology and a provenance ontology. HASNetO has been under development for more than one year, and has been reviewed, shared and used by multiple scientific communities. The ontology has been in use to support the data management of a number of large-scale ecological monitoring activities (observations) and empirical experiments.
Semantic Support for Complex Ecosystem Research Environments
As ecosystems come under increasing stresses from diverse sources, there is growing interest in research efforts aimed at monitoring, modeling, and improving understanding of ecosystems and protection options. We aimed to provide a semantic infrastructure capable of representing data initially related to one large aquatic ecosystem research effort - the Jefferson project at Lake George. This effort includes significant historical observational data, extensive sensor-based monitoring data, experimental data, as well as model and simulation data covering topics including lake circulation, watershed runoff, lake biome food webs, etc. The initial measurement representation has been centered on monitoring data and related provenance. We developed a human-aware sensor network ontology (HASNetO) that leverages existing ontologies (PROV-O, OBOE, VSTO*) in support of measurement annotations. We explicitly support the human-aware aspects of human sensor deployment and collection activity to help capture key provenance that often is lacking. Our foundational ontology has since been generalized into a family of ontologies and used to create our human-aware data collection infrastructure that now supports the integration of measurement data along with simulation data. Interestingly, we have also utilized the same infrastructure to work with partners who have some more specific needs for specifying the environmental conditions where measurements occur, for example, knowing that an air temperature is not an external air temperature, but of the air temperature when windows are shut and curtains are open. We have also leveraged the same infrastructure to work with partners more interested in modeling smart cities with data feeds more related to people, mobility, environment, and living. We will introduce our human-aware data collection infrastructure, and demonstrate how it uses HASNetO and its supporting SOLR-based search platform to support data integration and semantic browsing. Further we will present learnings from its use in three relatively diverse large ecosystem research efforts and highlight some benefits and challenges related to our semantically-enhanced foundation.
2012
D2RCrime: A Tool for Helping to Publish Crime Reports on the Web from Relational Data
Tavares, JĂşlio,Â
Santos, Henrique, Furtado, Vasco, and Vasconcelos, Eurico
In the Law Enforcement context, more and more data about crime occurrences are becoming available to the general public. For an effective use of open data, it is desirable that the different sources of information follow a pattern, which allows reliable comparisons. In addition, it is expected that the task of creating a correspondence between the pattern and the internal representations of each source of information is not a steep learning curve. These two conditions are hardly found in the actual stage, where open data about crime occurrences refer to the data disclosed by each police department in its own way. This paper proposes an interactive tool, called D2RCrime, that assists the designer/DBA of relational crime databases to make the correspondence between the relational data and the classes and properties of a crime ontology. The ontology plays the role of a pattern to represent the concepts of crime and report of crime, and is also the interface to publish on-the-fly relational crime data. This correspondence allows the automatic generation of mapping rules between the two representations, what allows for access to relational data from SPARQL. An evaluation of D2RCrime is done with DBA/system analysts who used the tool for establishing correspondences between relational data and the ontology.
A Service-Oriented Architecture for Assisting the Authoring of Semantic Crowd Maps
Santos, Henrique, and Furtado, Vasco
In Advances in Artificial Intelligence - SBIA 2012 Oct 2012
Although there are increasingly more initiatives for the generation of semantic knowledge based on user participation, there is still a shortage of platforms for regular users to create applications on which semantic data can be exploited and generated automatically. We propose an architecture, called Semantic Maps (SeMaps), for assisting the authoring and hosting of applications in which the maps combine the aggregation of a Geographic Information System and crowd-generated content (called here crowd maps). In these systems, the digital map works as a blackboard for accommodating stories told by people about events they want to share with others typically participating in their social networks. SeMaps offers an environment for the creation and maintenance of sites based on crowd maps with the possibility for the user to characterize semantically that which s/he intends to mark on the map. The designer of a crowd map, by informing a linguistic expression that designates what has to be marked on the maps, is guided in a process that aims to associate a concept from a common-sense base to this linguistic expression. Thus, the crowd maps start to have dominion over common-sense inferential relations that define the meaning of the marker, and are able to make inferences about the network of linked data. This makes it possible to generate maps that have the power to perform inferences and access external sources (such as DBpedia) that constitute information that is useful and appropriate to the context of the map. In this paper we describe the architecture of SeMaps and how it was applied in a crowd map authoring tool.
Open Government and Citizen Participation in Law Enforcement via Crowd Mapping
Furtado, Vasco, Caminha, Carlos, Ayres, Leonardo, and Santos, Henrique
The authors describe WikiCrimes, a project founded on a website of the same name that aims to offer a common interaction space for the general public where they can note criminal activity and track the locations where such crimes occur. The goal is to encourage collaborative participation that generates useful information for everyone. The authors describe their experience with WikiCrimes, emphasizing the services it provides to citizens and underlining some of the problems they’ve overcome to date. Their ultimate goal is to help establish a research agenda about the topic by inviting the academic community to embrace WikiCrimes as a platform that bridges citizen participation and open government.
2011
Widgets baseados em conhecimento advindo de dados referenciados e abertos na Web
The use of widgets is a very popular manner to make a website customization. From widget’s creation the content creator configures the website with functions that he/she consider adequate to the users. Typically, widgets relies on syndication (RSS) in which a website content is made available to other websites. Even though there is a huge popularity of this kind of widgets, they typically are constrained to what a data provider makes available. Linked Open Data (LOD) is an opportunity to cope with the today’s constraint in the process of widget creation and execution. We propose a platform, called SemWidgets (from Semantic Widgets), for the creation and execution of piece of programs able to access and reason over LOD. With SemWidgets, we provide to the content producer of a website a way to describe the concept(s) that best represent the content to be explored. Widgets created from SemWidgets have the power to perform inferences and access external sources that constitute information that may be useful and appropriate to the context of the website.