Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
[SOLVED] Help with xml, xpath, namespaces.
cindyharper
Member Posts: 9 Contributor II
Below is sample XML from GoogleCSE API:
<?xml version="1.0" encoding="UTF-8"?>
<feed gd:kind="customsearch#search" xmlns="http://www.w3.org/2005/Atom" xmlns:cse="http://schemas.google.com/cseapi/2010" xmlns:gd="http://schemas.google.com/g/2005" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
<title>Google Custom Search - Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu</title>
<id>tag:www.googleapis.com,2010-09-29:/customsearch/v1?q= Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu&cx=008033228147187897025:-ua_scxr1uc&num=7&start=1&safe=off</id>
<author>
<name>Library Website Search Engine - Google Custom Search</name>
</author>
<updated>1970-01-16T11:10:30.455Z</updated>
<opensearch:Url type="application/atom+xml" template="https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={cse:safe?}&cx={cse:cx?}&cref={cse:cref?}&sort={cse:sort?}&filter={cse:filter?}&gl={cse:gl?}&cr={cse:cr?}}&googlehost={cse:googleHost?}&c2coff={?cse:disableCnTwTranslation}&hq={cse:hq?}&hl={cse:hl?}&siteSearch={cse:siteSearch?}&siteSearchFilter={cse:siteSearchFilter?}&exactTerms={cse:exactTerms?}&excludeTerms={cse:excludeTerms?}&linkSite={cse:linkSite?}&orTerms={cse:orTerms?}&relatedSite={cse:relatedSite?}&dateRestrict={cse:dateRestrict?}&lowRange={cse:lowRange?}&highRange={cse:highRange?}&searchType={cse:searchType?}&fileType={cse:fileType?}&rights={cse:rights?}&imgsz={cse:imgsz?}&imgtype={cse:imgtype?}&imgc={cse:imgc?}&imgcolor={cse:imgcolor?}&alt=atom"/>
<opensearch:Query role="request" title="Google Custom Search - Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu" totalResults="7" searchTerms=" Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu" count="7" startIndex="1" inputEncoding="utf8" outputEncoding="utf8" cse:safe="off" cse:cx="008033228147187897025:-ua_scxr1uc"/>
<opensearch:totalResults>7</opensearch:totalResults>
<opensearch:startIndex>1</opensearch:startIndex>
<cse:context title="Library Website Search Engine"/>
<cse:searchInformation>
<cse:searchTime>0.073074</cse:searchTime>
<cse:formattedSearchTime>0.07</cse:formattedSearchTime>
<cse:totalResults>7</cse:totalResults>
<cse:formattedTotalResults>7</cse:formattedTotalResults>
</cse:searchInformation>
<cse:spelling>
<cse:correctedQuery type="html"/>
</cse:spelling>
<entry gd:kind="customsearch#result">
<id>http://www.albertus.edu/policy-reports/advancement-publications/documents/albertus-archive-october-2011-special-edition.pdf</id>
<updated>1970-01-16T11:10:30.455Z</updated>
<title type="html">Special Edition Athletics @lbertus <b>Newsletter</b></title>
<link href="http://www.albertus.edu/policy-reports/advancement-publications/documents/albertus-archive-october-2011-special-edition.pdf" title="www.albertus.edu"/>
<summary type="html">This weekend marks a busy and historic time on campus for the <b>Albertus</b>. <br> <b>Magnus College</b> Athletics Department as both the men&#39;s and women&#39;s soccer <b>...</b></summary>
<cse:cacheId>AJGUZgC9CVMJ</cse:cacheId>
<cse:mime>application/pdf</cse:mime>
<cse:fileFormat>PDF/Adobe Acrobat</cse:fileFormat>
<cse:formattedUrl type="html">www.<b>albertus.edu</b>/.../<b>albertus</b>-archive-october-2011-special-edition.pdf</cse:formattedUrl>
<cse:PageMap>
<cse:DataObject type="metatags">
<cse:Attribute name="creationdate" value="D:20111118135759-05'00'"/>
<cse:Attribute name="producer" value="Acrobat Web Capture 8.0"/>
<cse:Attribute name="moddate" value="D:20111118140743-05'00'"/>
<cse:Attribute name="title" value="Special Edition Athletics @lbertus Newsletter"/>
</cse:DataObject>
</cse:PageMap>
</entry>
...
</feed>
I'm using Generate Extract operator. I've specified the namespaces as:
<list key="namespaces">
<parameter key="x" value="http://www.kbcafe.com/rss/atom.xsd.xml"/>
<parameter key="xmlns:cse" value="http://schemas.google.com/cseapi/2010"/>
<parameter key="xmlns:gd" value="http://schemas.google.com/g/2005"/>
<parameter key="xmlns:opensearch" value="http://a9.com/-/spec/opensearch/1.1/"/>
<parameter key="xx" value="xml"/>
</list>
I've tried to extract xpath such as
//x:feed
//feed
and more specific - can't seem to match anyhting in ths feed. I'm sure the problem is in my namespaces, but I don't know where to go to find the answer.
The targets I want to extract are
//x:feed/x:entry/x:title
and //x:feed/x:entry/x:link/@href.
Any help would be appreciated.
<?xml version="1.0" encoding="UTF-8"?>
<feed gd:kind="customsearch#search" xmlns="http://www.w3.org/2005/Atom" xmlns:cse="http://schemas.google.com/cseapi/2010" xmlns:gd="http://schemas.google.com/g/2005" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
<title>Google Custom Search - Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu</title>
<id>tag:www.googleapis.com,2010-09-29:/customsearch/v1?q= Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu&cx=008033228147187897025:-ua_scxr1uc&num=7&start=1&safe=off</id>
<author>
<name>Library Website Search Engine - Google Custom Search</name>
</author>
<updated>1970-01-16T11:10:30.455Z</updated>
<opensearch:Url type="application/atom+xml" template="https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={cse:safe?}&cx={cse:cx?}&cref={cse:cref?}&sort={cse:sort?}&filter={cse:filter?}&gl={cse:gl?}&cr={cse:cr?}}&googlehost={cse:googleHost?}&c2coff={?cse:disableCnTwTranslation}&hq={cse:hq?}&hl={cse:hl?}&siteSearch={cse:siteSearch?}&siteSearchFilter={cse:siteSearchFilter?}&exactTerms={cse:exactTerms?}&excludeTerms={cse:excludeTerms?}&linkSite={cse:linkSite?}&orTerms={cse:orTerms?}&relatedSite={cse:relatedSite?}&dateRestrict={cse:dateRestrict?}&lowRange={cse:lowRange?}&highRange={cse:highRange?}&searchType={cse:searchType?}&fileType={cse:fileType?}&rights={cse:rights?}&imgsz={cse:imgsz?}&imgtype={cse:imgtype?}&imgc={cse:imgc?}&imgcolor={cse:imgcolor?}&alt=atom"/>
<opensearch:Query role="request" title="Google Custom Search - Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu" totalResults="7" searchTerms=" Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu" count="7" startIndex="1" inputEncoding="utf8" outputEncoding="utf8" cse:safe="off" cse:cx="008033228147187897025:-ua_scxr1uc"/>
<opensearch:totalResults>7</opensearch:totalResults>
<opensearch:startIndex>1</opensearch:startIndex>
<cse:context title="Library Website Search Engine"/>
<cse:searchInformation>
<cse:searchTime>0.073074</cse:searchTime>
<cse:formattedSearchTime>0.07</cse:formattedSearchTime>
<cse:totalResults>7</cse:totalResults>
<cse:formattedTotalResults>7</cse:formattedTotalResults>
</cse:searchInformation>
<cse:spelling>
<cse:correctedQuery type="html"/>
</cse:spelling>
<entry gd:kind="customsearch#result">
<id>http://www.albertus.edu/policy-reports/advancement-publications/documents/albertus-archive-october-2011-special-edition.pdf</id>
<updated>1970-01-16T11:10:30.455Z</updated>
<title type="html">Special Edition Athletics @lbertus <b>Newsletter</b></title>
<link href="http://www.albertus.edu/policy-reports/advancement-publications/documents/albertus-archive-october-2011-special-edition.pdf" title="www.albertus.edu"/>
<summary type="html">This weekend marks a busy and historic time on campus for the <b>Albertus</b>. <br> <b>Magnus College</b> Athletics Department as both the men&#39;s and women&#39;s soccer <b>...</b></summary>
<cse:cacheId>AJGUZgC9CVMJ</cse:cacheId>
<cse:mime>application/pdf</cse:mime>
<cse:fileFormat>PDF/Adobe Acrobat</cse:fileFormat>
<cse:formattedUrl type="html">www.<b>albertus.edu</b>/.../<b>albertus</b>-archive-october-2011-special-edition.pdf</cse:formattedUrl>
<cse:PageMap>
<cse:DataObject type="metatags">
<cse:Attribute name="creationdate" value="D:20111118135759-05'00'"/>
<cse:Attribute name="producer" value="Acrobat Web Capture 8.0"/>
<cse:Attribute name="moddate" value="D:20111118140743-05'00'"/>
<cse:Attribute name="title" value="Special Edition Athletics @lbertus Newsletter"/>
</cse:DataObject>
</cse:PageMap>
</entry>
...
</feed>
I'm using Generate Extract operator. I've specified the namespaces as:
<list key="namespaces">
<parameter key="x" value="http://www.kbcafe.com/rss/atom.xsd.xml"/>
<parameter key="xmlns:cse" value="http://schemas.google.com/cseapi/2010"/>
<parameter key="xmlns:gd" value="http://schemas.google.com/g/2005"/>
<parameter key="xmlns:opensearch" value="http://a9.com/-/spec/opensearch/1.1/"/>
<parameter key="xx" value="xml"/>
</list>
I've tried to extract xpath such as
//x:feed
//feed
and more specific - can't seem to match anyhting in ths feed. I'm sure the problem is in my namespaces, but I don't know where to go to find the answer.
The targets I want to extract are
//x:feed/x:entry/x:title
and //x:feed/x:entry/x:link/@href.
Any help would be appreciated.
0
Answers
how are you trying to extract XPaths? Your current process setup and maybe some sample data would be useful to write a well-founded answer.
Best,
Marius
My latest attempt was to try to take the import statements out of both the GooglePage attribute ( see Replace operator), and out of the .xsd. So the xsd looks like this: I wasn't able to follow the import xsd links from the google output in my browser, so that's why I decided to try to dispense with them.
please have a look at the attached process. The trick is to prepend //entry with "atom:" like this: //atom:entry and to define the atom prefix in the namespaces parameter exactly as it is written in the xml data.
Best,
Marius