ALU

nlp

How to Know What You Know: 5-Minute Interview

"I want to know what I know. That describes what knowledge graphs do for companies," said Dr. Alessandro Negro, Chief Scientist at GraphAware.

In this week’s five-minute interview, we discuss how GraphAware uses natural language processing to help companies gain a better understanding of the knowledge that is spread across their organization.

Original Link

Apache NiFi + Apache OpenNLP With Organizations and Flow Files

Updating the Apache OpenNLP Community Apache NiFi Processor to Support Flow Files

In this new release, we add the ability to read content from the FlowFile and analyze that for Locations, Dates, Organizations, and Names. We are using the Apache OpenNLP 1.5 Models that are available for download. These do a decent job. You can build new models as needed. I also changed it to output one attribute per type with a String list of locations, organizations, dates, and names.

I put out a new release, built around Apache NiFi 1.6.0.

Source and NAR Download

You can check out the source code on GitHub.

Download the pre-trained models for your language here

I chose English (en).

In a future release, I may add Organization, Money, Time, and Percentage to the lists we extract if there is interest.

A Final JSON File Produced

{"created_at":"Thu May 10 16:55:17 +0000 2018","id":994621913115840512,"id_str":"994621913115840512","text":"Inflated 3D Convnet or I3D model trained for action recognition on kinetics-400. https:\/\/t.co\/4Udj1jTSVp","source":"\u003ca href=\"https:\/\/ifttt.com\" rel=\"nofollow\"\u003eIFTTT\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2496666240,"id_str":"2496666240","name":"Brent Arnichec","screen_name":"luckflow","location":"San Francisco, CA","url":"http:\/\/emulai.com","description":"#ArtificialIntelligence #MachineLearning #DeepLearning #IoT #fintech #Bigdata #Technology #Science #Robotics #DL #tech #Blockchain #Computing #AI","translator_type":"none","protected":false,"verified":false,"followers_count":146,"friends_count":711,"listed_count":14,"favourites_count":1,"statuses_count":822,"created_at":"Thu May 15 16:21:13 +0000 2014","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"E81C4F","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/878328407003496450\/i2Ii4dAz_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/878328407003496450\/i2Ii4dAz_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2496666240\/1498327723","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/4Udj1jTSVp","expanded_url":"https:\/\/www.tensorflow.org\/hub\/modules\/deepmind\/i3d-kinetics-400\/1","display_url":"tensorflow.org\/hub\/modules\/de\u2026","indices":[81,104]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1525971317925"}

Example Output

The Main Flow for Trying Out the NLP Processor

Set Your Models

New NLP Processor Documentation

Here is the schema to use to process this data. Note that nlp_namesis a String of comma delimited values. You may want to parse this or do additional processing in these fields.

High-Level Flow

Original Link

Transmuting Documents Into Graphs

Alchemy is a philosophical and proto-scientific tradition practiced throughout Europe, Africa, and Asia. Its aim is to purify, mature, and perfect certain objects. In popular culture, we often see the case of shadowy figures trying to turn lead into gold to make themselves immensely rich or to ruin the world economy. In our case, we will not be transmuting lead into gold, but documents into graphs which is just as good. In the past, we had used “Alchemy API” but they were purchased by IBM and retired. You can get similar functionality with IBM Watson, but let’s do something else instead. Let’s add Entity Extraction right into Neo4j.

The concept is to take a document, be it a text file, word file, pdf, powerpoint, excel, etc., and have Tika detect and extract metadata and text, then run that text through a set of NLP models from OpenNLP to find interesting Entities. Let’s go ahead and build a stored procedure.

If I think I may have too much in the stored procedure itself, sometimes I just make a callable and stream the results. That’s all we are doing here. There is one big spoiler in here already in that we are going to ingest documents in more than just English.

@Procedure(name = "com.maxdemarzi.en.ingest", mode = Mode.WRITE)
@Description("CALL com.maxdemarzi.en.ingest")
public Stream<GraphResult> IngestEnglishDocument(@Name("file") String file) throws Exception { IngestDocumentCallable callable = new IngestDocumentCallable(file, "English", db, log); return Stream.of(callable.call());
}

Our procedure is going to return a GraphResult so we will need a place to hold our nodes and relationships as we find or create them:

@Override
public GraphResult call() { List<Node> nodes = new ArrayList<>(); List<Relationship> relationships = new ArrayList<>();

We don’t know what the filetype of the document is going to be so we will use an AutoDetectParser to deal with it per these instructions.

BodyContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();

Next, we will parse and capture the text of the document:

String text = "";
try (InputStream stream = new FileInputStream(new File(file))) { parser.parse(stream, handler, metadata); text = handler.toString();
} catch (Exception e) { log.error(e.getMessage());
}

With our text in hand, we can use an OpenNLPNERecogniser to find interesting entities and store them in a map. I’m choosing between two languages here but, of course, you could add more.

Map<String, Set<String>> recognized;
switch (language) { case "English": recognized = ENGLISH_NER.recognise(text); break; case "Spanish": recognized = SPANISH_NER.recognise(text); break; default: recognized = new HashMap<>();
}

So wait, what are ENGLISH_NER and SPANISH_NER anyway? They are OpenNLPNERecogniser objects that take a map as input which directs them to the locations of the language files of the pre-trained models. You can find more pre-trained language models on the web or you can train your own.

private static final Map<String, String> ENGLISH = new HashMap<String, String>() {{ put(PERSON, "models/en-ner-person.bin"); put(LOCATION, "models/en-ner-location.bin"); put(ORGANIZATION, "models/en-ner-organization.bin"); put(TIME, "models/en-ner-time.bin"); put(DATE, "models/en-ner-date.bin"); put(PERCENT,"models/en-ner-percentage.bin"); put(MONEY,"models/en-ner-money.bin");
}}; static final OpenNLPNERecogniser ENGLISH_NER = new OpenNLPNERecogniser(ENGLISH);

With that out of the way, let’s get back to our procedure. We first go ahead and create a new document node with our text, file name, and language. Then we add it to our list of nodes to return in our result set.

try(Transaction tx = db.beginTx() ) { Node document = db.createNode(Labels.Document); document.setProperty("text", text); document.setProperty("file", file); document.setProperty("language", language); nodes.add(document);

Then for every type of entity our language model recognized, we check to see if the entity already exists, or create it and then add a relationship from the document to this entity. We add our entity and relationships to our result sets as we get them and finally we call success to make sure our transaction gets committed.

for (Map.Entry<String, Set<String>> entry : recognized.entrySet()) { Label label = Schema.LABELS.get(entry.getKey()); for (String value : entry.getValue()) { Node entity = db.findNode(label, "id", value); if (entity == null) { entity = db.createNode(label); entity.setProperty("id", value); } nodes.add(entity); Relationship has = document.createRelationshipTo(entity, RelationshipTypes.HAS); relationships.add(has); }
}
tx.success();

Let’s compile our procedure, add it to the plugins folder of Neo4j, restar neo4j, and try a few documents. But first let’s go ahead and create some indexes for the “id” property for each of our types of entities:

CALL com.maxdemarzi.schema.generate;

Now we can try one by calling:

CALL com.maxdemarzi.en.ingest('data/en_sample.txt');

Cool, it works! How about a PDF file instead of a text file:

CALL com.maxdemarzi.en.ingest('data/en_sample.pdf');

Nice. How about a Spanish PDF:

CALL com.maxdemarzi.es.ingest('data/es_sample.pdf');

Sweet! It looks like we could use a little disambiguation and some cleanup, but this is enough to get us started. The source code as always is on GitHub. Be sure to take a look at the Tika documentation to learn more about what it can do.

Finally, if you want to see a plugin with more NLP functionality check out the work our partners at GraphAware have done with their Neo4j NLP plugin.

Original Link

How to Tackle Big Data With Natural Language Processing

Natural language processing (NLP) is a pretty exciting frontier of research that products such as Siri, Alexa, and Google Home have tapped into in order to bring a new level of interaction to their respective products. In order to use NLP ideally, we must look at how this particular type of processing can help us, what we intend to gain from making use of it, and how we go from raw data to the final product. If you’re only just beginning to look at NLP, it can be an overwhelming experience, but by breaking the process down into more manageable parts, we can navigate this topic with ease.

Starting With the Basics

The basic processing we’re looking at is how to turn regular, everyday text into something understandable by the computer. From it, we can extract things like jargon, slang, and even the speaking style of someone else. The basics of this processing will take the Unicode characters and separate them into words, phrases, sentences, and other linguistic delineations such as tokenization, decompounding, and lemmatization. Using all of these strategies, we can start to pick apart the language and even determine which language it is by the words and spelling present alongside the punctuation. Before we can build up the language for use, we must first break it down and analyze its component parts so we can understand how it works.

Figuring Out the Scope

Looking at a large block of text can make it difficult to determine what exactly the text is about, even for a human. Do we need to know the general gist of the text or is it more prudent to figure out what’s being said within the text body itself? This is what we term macro understanding and micro understanding. NLP is limited by cost and time factors and certain levels of processing are simply not available because of these constraints. Once we get an idea of what scope we’re aiming for, we can now move on to extraction.

Extraction of Content for Processing

Macro understanding allows us to figure out what the general gist of the document we’re processing is about. We can use that for classification, extracting topics, summarization of legal documents, semantic search, duplicate detection, and keyword or key phrase extraction. If we’re looking at micro understanding, we can use processing to read deeper into the text itself and extract acronyms and their meanings or the proper names of people or companies. In micro understanding, the word order is extremely important and must be preserved.

Back Trace Availability

Once we’ve extracted data from a particular document, we’ll want to ensure that we know where that data comes from. Having a link to where the source document can save lots of time in the long run. This tracing can help to track down possible errors in the text, and if one of those source documents gets updated to a newer version, future changes can be reflected on the extracted information with a minimum of reprocessing, which will save time and processing power.

Human Feedback

The best method of developing NLP to adapt is to teach it how to listen to feedback that comes from people who created the language: humans themselves. Feedback from people about how an NLP system performs should be taken to help adapt it to what we want it to do.

Keeping Ahead of the Curve

Constant quality analysis is crucial to ensuring that an NLP fulfills its role and adapts to the world around it. Creating an NLP is basically teaching a computer how to learn from its mistakes and how to garner feedback to improve itself. By itself, big data is daunting and repetitive and can have a lot of insight buried inside it. By developing an NLP, you give a computer a task that it is well-suited to do while at the same time teaching it to think like a human in its extraction process. It’s the best of both worlds.

Original Link

Solving a Polyglot Error in Python

I wanted to use the polyglot NLP library that my colleague Will Lyon mentioned in his analysis of Russian Twitter trolls, but had installation problems that I thought I’d share in case anyone else experiences the same issues.

I started by trying to install polyglot:

$ pip install polyglot ImportError: No module named 'icu'

Hmmm, I’m not sure what icu is, but luckily there’s a GitHub issue covering this problem. That led me to Toby Fleming’s blog post that suggests the following steps:

brew install icu4c
export ICU_VERSION=58
export PYICU_INCLUDES=/usr/local/Cellar/icu4c/58.2/include
export PYICU_LFLAGS=-L/usr/local/Cellar/icu4c/58.2/lib
pip install pyicu

I already had icu4c installed, so I just had to make sure that I had the same version of that library as Toby did. I ran the following command to check that:

$ ls -lh /usr/local/Cellar/icu4c/
total 0
drwxr-xr-x 12 markneedham admin 408B 28 Nov 06:12 58.2

That still wasn’t enough, though! I had to install these two libraries, as well:

pip install pycld2
pip install morfessor

I was then able to install polyglot, but had to then run the following commands to download the files needed for entity extraction:

polyglot download embeddings2.de
polyglot download ner2.de
polyglot download embeddings2.en
polyglot download ner2.en

And that’s all!

Original Link

Chatbot for banks developer Active.ai raises over $8m in series A round

Microsoft Accelerator Summer Cohort 2017

The summer 2017 cohort of Microsoft’s India accelerator in Bengaluru. Photo credit: Microsoft Accelerator.

Active.ai has raised US$8.25 million in a series A round co-led by Vertex Ventures, CreditEase, and Dream Incubator, it announced today.

Existing investors IDG Ventures India and Kalaari Capital also participated in the funding.

Vani Kola, managing director at Kalaari Capital and a board member at Active.ai, said in a statement that the capital will fuel the startup’s continued R&D efforts, as well as its plans for expansion to other countries.

Active.ai co-founder and CEO Ravi Shankar added that the investment will also be used to grow the company’s technical team.

Headquartered in Singapore with an R&D base in Bengaluru, Active.ai is one of a large number of startups in the region that has developed a customer service chatbot.

In its case, the focus is on serving banks and financial institutions. Active.ai’s chatbots are powered by its artificial intelligence (AI) engine named Triniti, which is capable of natural language processing and generation.

The platform allows banks to respond to queries or complaints sent by customers through messaging apps such as Facebook Messenger and Line. A Triniti-powered chatbot can then engage in a conversation with those customers to try to solve their queries – saving the bank time, money, and manpower, and allowing it deploy its human employees to higher-level tasks.

Beyond chatbots, the startup is also aiming to apply its AI engine to channels such as voice calls, text messaging, and virtual reality, so that banks can automate more of their customer-facing operations.

Active.ai’s last funding round was a US$3 million investment from Kalaari and IDG about a year ago. The startup was selected to join Microsoft’s India accelerator program earlier this year.

Editing by Michael Tegos

(And yes, we’re serious about ethics and transparency. More information here.)

About Jack

Sweltering in Singapore. Email: jack@techinasia.com Twitter: @jacknwellis

Original Link

The Secret to Getting Data Lake Insight: Data Quality

More and more companies around the globe are realizing that big data and deeper analytics can help improve their revenue and profitability. As such, they are building data lakes using new big data technologies and tools, so they can answer questions such as, How do we increase production while maintaining costs? How do we improve customer intimacy and share of wallet? What new business opportunities should we pursue? Big data is playing a major role in digital transformation projects; however, companies that do not have trusted data at the heart of their operations will not realize the full benefits of their efforts.

Instituting Sustainable Data Quality and Governance Measures

If big data is to be used, organizations need to make sure that this information collection is under control and sticks to a high standard. Yet, according to a recent report by KPMG, 56% of CEOs are concerned about the quality of the data they’re using to base decisions. To improve the trustworthiness of data as it flows through the enterprise, companies need to look at the entire data quality lifecycle including metadata management, lineage, preparation, cleansing, profiling, stewardship, privacy, and security.

A few weeks ago, Gartner released the 2017 Gartner Magic Quadrant for Data Quality Tools — a report that reviews the data quality lifecycle and showcases innovative technologies designed to “meet the needs of end-user organizations in the next 12 to 18 months.”

The report highlights the increasing importance of data quality for the success of digital transformation projects, the need to use data quality as a means to reduce costs, and the changing requirements to be a leader. Some of the trends highlighted in the report that speak directly to data lake development and usage include:

  • The need to capture and reconcile metadata.
  • The ability to connect to a wide variety of on-premises and cloud structured and unstructured data sources.
  • The importance of DevOps and integration interoperability in the data quality environment.
  • How business users are now the primary audience and need data quality workflow and issue resolution tools.
  • The increasing requirement for real-time data quality services for low-latency applications.

Machine Learning and Natural Language Processing to the Rescue

As companies ingest large amounts of unstructured and unknown data, it can be a challenge to validate, cleanse, and transform the data in sufficient time without delaying real-time decisions and analytics. This does not mean that 100% of the data lake needs to be sanctioned data, as companies will create a data lake partition of “raw data” which data scientists often prefer for analysis. In addition, raw and diverse data can be provisioned among different roles before enrichment, shifting from a single version of the truth model to a more open and trustworthy collaborative governance model.

In the past, data quality would rely solely on a complex algorithm; for example, probabilistic matching for deduplicating and reconciling records. An important trend we are seeing at Talend and outlined in the Gartner report is the use of machine learning with data quality to assist with matching, linking, and merging data. With the sheer volume and variety of data in the data lake, using Hadoop, Spark, and machine learning for data quality processing means faster time to trusted insight. Data science algorithms can quickly sift through gigabytes of data to identify relationships between data, duplicates, and more. Natural language processing can help reconcile definitions and provide structure to unstructured text, providing additional insight when combined with structured data.

Machine learning can be a game changer because it can capture tacit knowledge from the people that know the data best, then turn this knowledge into algorithms, which can be used to automate data processing at scale. Furthermore, through smarter software that uses machine learning and smart semantics, any line of business user can become a data curator – making data quality a team sport! For example, tools such as Talend Preparation and Data Stewardship combine a highly interactive, visual and guided user experience with these features to make the data curation process easier and the data cleansing process faster.

Devising a Plan for Agile Data Quality in the Data Lake

Implementing a data quality program for big data can be overwhelming. It is important to come up with an incremental plan and set realistic goals; sometimes getting to 95% is good enough.

  1. Roles: Identify roles, including data stewards and users of data.

  2. Discovery: Understand where data is coming from, where it is going, and what shape it is in. Focus on cleaning your most valuable and most used data first.

  3. Standardization: Validate, cleanse, and transform data. Add metadata early so that data can be found by humans and machines. Identify and protect personal and private organizational data with data masking.

  4. Reconciliation: Verify that data was migrated correctly.

  5. Self-service: Make data quality agile by letting the people who know the data best clean their own data.

  6. Automate: Identify where machine learning in the data quality process can help, such as data deduplication.

  7. Monitor and manage: Get continuous feedback from users, come up with data quality measurement metrics to improve.

In summary, for companies to get the most out of their digital transformation projects and build an agile data lake, they need to design data quality processes from the start.

*Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Original Link

Line prepares to challenge Amazon’s Alexa and Google Home

Video credit: Line.

A father and his young daughter are preparing to go to sleep. “Lights off,” he says to a glowing smart speaker, next to their bed.

The smart speaker is Line’s Wave – which hit the shelves on October 5.

“Lights on,” their daughter protests. Dad laughs.

“No, lights off.”

“Lights on!”

Later, Mom, who’s on the way home from work, sends a message to the smart speaker: “Slept yet?”

The daughter replies: “Clova, tell Mama that Papa is asleep.”

This is Line’s vision of its smart speaker, Clova Wave. It’s heartwarming, but the question remains: can Wave and its digital assistant, Clova challenge Alexa or Google Assistant?

Image credit: Line.

Line playing catch up

“Line is only six years old; other companies have more than fifteen years’ of experience,” Line CTO Park Euivin says through a translator at Line Developer Day 2017.

“But even if we have a late start, I don’t think there is any company that matches us in speed,” she says.

Within a year, Line had created Clova, which stands for “cloud-based virtual assistant.”

The accompanying hardware, Wave, opened for pre-orders in July this year. Line Developer Day 2017 marks the official release of the smart speaker.

Wave’s specs at Line Developer Day 2017. Photo credits: Tech In Asia.

Park says that Line has parent company Naver, which owns the dominant search engine in Korea. Like Google and its device, there’s potential for Wave to work with Naver’s search data.

Also, with key offices in Taiwan, Thailand, Indonesia, and Korea, Line is likely to have access to language specialists in these countries, whether in-house hires or external partners.

On top of that, Line reported a monthly active user count of 169 million across four key countries at Line Developer Day. The wealth of user data available means localization opportunities for the home speaker.

“I don’t think anyone has an advantage [in the smart home market],” says Park. “We have data about Asian users and we can leverage that.”

Line says they’re the first to launch a Japanese smart speaker powered by AI.

Apart from collaborations with big boys like Sony and LG, Line is also open to partnering any companies to make their users’ lives better. Their current partners range from convenience store Family Mart to holographic home robot manufacturer Gatebox.

Video credit: Line.

A long road

Line faces formidable foes in Google and Amazon. Google has several ways to get user data and lock users in. Google Chrome is the dominant web browser. Gmail has more than one billion monthly active users. Over 85 percent of the world’s web searches came from Google in July 2017.

When asked about collaboration with Google specifically, Taiichi Hashimoto, who oversees the Clova project, says Line wouldn’t rule out the possibility.

Google Home shipments are expected to reach several million units in 2017. Meanwhile, Amazon Echo has surpassed them – shipments are expected to reach over 10 million units in 2017.

It’ll be an uphill battle for Line, whose team apologized for shipment delays of the trial version of Wave released in August.

But Line’s main goal for Clova and Wave in 2017 is not user acquisition. Wave appears to be just a side dish to complement Clova: Line is focused on creating API openings, growing the Clova developer ecosystem, and involving as many engineers as possible in Clova.

At the moment, Wave can play music, tell the weather, time, and your fortune, act as a remote control, read and send Line messages, and chat with users. But with in-house, third-party developers, and partners, Line plans to increase Wave’s features and fine-tune the smart speaker.

“Hopefully, we can create many different applications and we will grow from there,” Hashimoto says.

Original Link