This is the report of Day 3 of the IPTC Autumn 2018 Meeting in Toronto. See the report from Day 1 and the report from Day 2. All the presentations are available to IPTC members in the IPTC Members Only Zone.
Day 3 of IPTC Autumn Meetings always includes the Annual General Meeting, where all Voting Members can have their say in the future of the organisation. This time new Managing Director Brendan Quinn gave his first MD’s report, alongside Stuart Myles’ Chairman’s Report (which will be posted to the IPTC blog soon). Materials from the AGM are available to members in the IPTC Members Only Zone.
Rounding out the discussions for the three days, we had some broad-ranging and future-facing conversations regarding News Credibility projects, where Stuart Myles took us on a tour of the wide range of projects and initiatives around misinformation, the credibility of news and news sources, and the perceived problems of “fake news.” IPTC or IPTC members are helping out several organisations in their efforts in this area such as the w3C Credible Web community group and the Journalism Trust Initiative.
We also had a discussion on funding opportunities and potential IPTC projects, which is an internal discussion involving members only.
Lastly, speaking about the future, we had Michael Young from Civil Media speak to us about their plans to use blockchain technologies to power small newsrooms and fulfil their broad goal to “power sustainable journalism throughout the world.” A lot of focus has been on Civil’s Initial Coin Offering, which closed underfunded and will be returning investors’ money, but they have many other activities, including a suite of WordPress-based plugins allowing news providers to join the Civil ecosystem and pledge openness, fairness and transparency according to the Civil Foundation’s constitution. Mike explained how blockchain based voting and decisions mean that members can be rewarded for pointing out breaches of the constitution, and bad actors can be punished or even removed from the network entirely.
The event ended with a few of us attending the Canadian Journalism Foundation’s event with journalism pundits Vivian Schiller, Jeff Jarvis, Jay Rosen and Matthew Ingram, talking about misinformation and misuse of social media (video recording available via the above link), and ten of us went on a networking and team bonding trip to Niagara Falls and to a local winery on the Thursday.
Overall it was a great Autumn Meeting which set the scene and built the foundation for many more great IPTC meetings to come!
This is the report of Day 2 of the IPTC Autumn 2018 Meeting in Toronto. See the report from Day 1 and the report from Day 3. All the presentations are available to IPTC members in the IPTC Members Only Zone.
Day 2 of the IPTC Autumn 2018 Meeting in Toronto was a deep dive into search and classification. Many of our members are working hard to make their content accessible quickly and easily to their customers, and user expectations are higher than ever, so search is a key part of what they do.
First up we had Diego Ceccarelli from Bloomberg talking through their search architecture. Users of Bloomberg terminals have very high expectations that they will see stories straight away: They have 16m queries and 2m new stories and news items per day, with requirements for a median query response time of less than 200ms and for new items to be available in search results in less than 100ms. And as Diego says, “with huge flexibility comes huge complexity.” For example, because customers expect to see the freshest content straight away, the system has no caching at all!
To achieve this, the Bloomberg team use Apache Solr – in fact they have 3 members of staff dedicated to working on Solr full-time, and have contributed a huge amount of code back to the project, including their machine-learning-based “learning to rank” module which can be trained to rank a set of search results in a nuanced way. Bloomberg also worked with an agency to develop open source code used to monitor a stream of incoming stories against queries, used for alerting. Other topics Diego raised included clustering of search results, balancing relevance and timeliness, crowdsourcing data to train ranking systems, combining permissions into search results, and more – a great talk!
Our heads already reeling with all the information we learned from Bloomberg, we then heard from another search legend, Boerge Svingen, one of the founders of FAST Search in Norway and now Director of Engineering at the New York Times. He spoke about how NYT re-architected their search platform to be based around Apache Kafka, a “distributed log streaming” platform that keeps a record of every article ever published on the Times (since 1851!) and can replay all of them to feed a new search node in around half an hour. The platform is so successful that it is used to feed the “headless CMS” (see yesterday’s report) based on GraphQL which is used to render pages on nytimes.com for all types of devices. Boerge and his team use Protocol Buffers as their schema to keep everything light and fast. More information in Boerge’s slide deck, available to IPTC members.
Next up was Chad Schorr talking about search at Associated Press, discussing their Elastic implementation on Amazon Web Services. Using a devops approach based on “immutable infrastructure” meant that the architecture is now very solid and well-tested. Chad was very open and spoke about issues and problems AP had while they were implementing the project and we had a great discussion about how other organisations can avoid the same problems.
Then Robert Schmidt-Nia from DPA talked about their implementation of a content repository (in effect another “headless CMS”!) based on serialising NewsML-G2 into JSON using a serverless architecture based on Amazon Lambda functions, AWS S3 for storage, SQS queues and Elasticsearch. Robert told of how the entire project was built in three months with one and a half developers, and ended up with only 500 lines of code! It can now be used to provide services to DPA customers that could not be provided before, including subsets of content based on metadata such as all Olympics content.
Next, Solveig Vikene and Roger Bystrøm from Norway’s news agency NTB spoke about and gave a live demo of their new image archive search product. They demonstrated how photographers can pre-enter metadata so that they can send their photos to the wire a few seconds after taking them on the camera. Some functions like global metadata search and replace and a feature-rich query builder made their system look very impressive.
Veronika Zielinska from Associated Press spoke about AP’s rule-based text classification systems, showing the complexity of auto-tagging content (down to disambiguating between two US Republican Congressmen both called Mike Rogers!) and the subtlety of AP’s terms (distinguishing between “violent crime” events versus the social issue of “domestic violence”) therefore the necessity of manually creating, and maintaining, a rules-based system.
Stuart Myles then took us on a tour through AP’s automated image classification activities, looking at whether commercial tools are yet up to the task of classifying news content, the value of assembling good training sets but the difficulties in doing so, and the benefits of starting with a relatively small taxonomy that is easier for machine learning systems to understand.
Dave Compton talked us through Thomson Reuters Knowledge Items used by the OpenCalais classifier and how they use the PermID system to unify concepts across their databases of people, organisations, financial instruments and much more. Dave described how Knowledge Items are represented as NewsML-G2 Knowledge Items, and are mapped to Media Topics where possible.
On that subject, Jennifer Parrucci of the New York Times, and chair of the IPTC NewsCodes Working Group, gave an update on the latest activities of the group, including the ongoing Media Topic definitions review, adding new Media Topic terms after suggestions by the Swedish media industry, and work with schema.org team on mapping between schema.org and Media Topics terms.
As you can see, it was a very busy day!
- All XML Schemas plus full documentation (about 60 MB) from https://www.iptc.org/std/NewsML-G2/NewsML-G2_2.28.zip
- The same without XML Schema documentation in HTML (about 3 MB) from https://www.iptc.org/std/NewsML-G2/NewsML-G2_2.28-noXMLdocu.zip
- From the newsml-g2 repository on GitHub as a Release: https://github.com/iptc/newsml-g2
Please note that the XML examples have been temporarily removed as we have not yet updated them to 2.28. The pack will be updated when the examples are brought up to date.
Update on 6 November: examples have now been updated to 2.28 and are now available on the above links. Enjoy!
Details of the changes made in version 2.28 can be found on http://dev.iptc.org/G2-Approved-Changes.
In summary the changes are:
- Add new element derivedFromValue. Previously we could say that elements were derived from a concept using the derivedFrom element. But if a system creates a new property based on another existing property, such as a slugline, there was no way of representing it.
- Add a new element metadataCreator to itemMeta. This allows us to represent NewsML-G2 items that have had metadata created by a third-party person or system, without having to specify the creator on each metadata property individually.
The NewsML-G2 Implementation Guidelines are available at https://www.iptc.org/std/NewsML-G2/guidelines.
Note on Power and Core Conformance Levels
As a reminder of an important decision taken for NewsML-G2 version 2.25 which also applies to version 2.28: the Core Conformance Level will not be developed any further as all recent Change Requests were in fact aiming at features of the Power Conformance Level, changes of the Core Level were only a side effect.
The Core Conformance Level specifications of version 2.24 will stay available and valid. Find them at http://dev.iptc.org/G2-Standards#CCLspecs
This is the report of Day 1 of the IPTC Autumn 2018 Meeting in Toronto. See the report from Day 2 and the report from Day 3. All the presentations are available to IPTC members in the IPTC Members Only Zone.
This week we are in Toronto for the IPTC Autumn Meeting. Unfortunately the weather is not as warm as it was last week but we are still enjoying ourselves immensely and learning a lot from each other!
All presentations are available to members on the members-only event page.
After an introduction from Chair Stuart Myles, we heard an update from Michael Steidl, chair or the Video Metadata and Photo Metadata Working Groups. Michael updated us on work promoting the IPTC Video Metadata Hub standard, talking to manufacturers and software vendors at events like IBC in Amsterdam, and pulling together use cases and success stories from existing users of the standard.
On the IPTC Photo Metadata Standard, Michael shared news about the fact that Google Images now displays IPTC Photo Metadata project and the press we have received since that time. Also we are working on new technical features in the standard such as metadata for regions within images. We’re looking for use cases and requirements for storing metadata against regions, so if you have any input, please let Michael, or IPTC Managing Director Brendan Quinn, know!
Dave Compton of Refinitiv, formerly the Financial & Risk business of Thomson Reuters, chair of the NewsML-G2 Working Group, gave an update on recent progress and work towards NewsML-G2 version 2.28 which will be released soon. It will incorporate features for the requirements of auto-tagging systems and a new experimental namespace to be used for potential new updates to NewsML-G2 that aren’t yet ready to be added to the full specification.
The experimental extension to NewsML-G2 is already put in use by Gerald Innerwinkler of APA and Robert Schmidt-Nia of DPA who presented an update on a current project between IPTC and MINDS International looking at metadata for suggesting news stories to users based on psychological and emotional characteristics, plus properties like the likely timeliness for different types of user. Based on the Limbic Map concept from marketing theory, the new proposals are in testing right now.
Chair of the Sports Content Working Group, Johan Lindgren of TT in Sweden, presented an update on SportsML and the work on SportsJS which is nearing a final version now that JSON Schema is soon able to support some new properties that we need to be able to validate Sports content.
Stuart Myles appeared again in his role as chair of the Rights Working Group, updating us on RightsML and where we can take it in the future, including the potential to use RightsML as the basis of blockchain-based rights management systems.
Then we had a focus on “new-generation editorial systems” including a great presentation from Peter Marsh of new IPTC member NEWSCYCLE Solutions on the history and state of the art of content management systems from Tandem-based SII workstations in the 1980s, all the way through to the current wave of headless CMSs as illustrated by this project by The Economist.
Stephane Guerrilot of AFP finished day one presenting AFP’s new-generation system, Iris, which enables AFP customers and partners to search for stories, video and images.
Stay tuned for a report on Day Two!