How Competition Affects Trust In The Workplace

Competition is widely believed to have a corrosive impact on trust, but what happens when that competition comes from a rival firm? Does the external "threat" bind us together? That was the conclusion reached by a recent study from the University of British Columbia, Princeton University and Aix-Marseille University.

The researchers collected data from the manufacturing sector in both the United States and Germany, and it emerged that the more intense the competition within the sector, the more likely it was for pro-social behaviors, such as cooperation and knowledge sharing, within each company.

Original Link

The Effects Of Working With A Lazy Manager

It’s often said that we leave jobs in large part because of the bosses we worked under. There are many things about a bad manager that can prompt our wrath, but perhaps the foremost sin is to be workshy. New research from the University of Exeter highlights just how pernicious working under a lazy manager can be to our well-being and productivity.

The researchers wanted to examine the impact of having a lazy boss on the team working under them, both in terms of their mentality and productivity. The analysis revealed that leaders who regularly procrastinate both in doing tasks and making decisions result in lower commitment levels among employees. More worryingly, it also saw an increase in mendacious behaviors among staff.

Original Link

Leadership Lessons for Creating High-Performing Scrum Teams

This is the last in a series of 3 blogs presenting the result of an interesting research study from Sam Walker.
Walker discovered that the most successful sports teams that ever existed all shared one single element: they all had a team captain with 7 overlapping traits that made them extremely successful. In this blog, we will explore what Agile Leaders can learn from these extremely successful team captains.

6 Lessons to Learn from Elite Team Captains

The Scrum Master role has a lot of overlap with the team captains from Walker’s research.
Another overlap with Walker’s research is the role of the sports coach: the Agile Leader, responsible for the Scrum Teams.

Original Link

Key Insights from the 2018 State of the Enterprise Datacenter Report

The 2018 State of the Enterprise Datacenter report presents the responses of more than 2,000 IT professionals worldwide who weighed in with their thoughts on these questions and many more about hyperconvergence, datacenter operational maturity, public vs. private cloud, and more.

Let’s take a look at some of the highlights of the report and see what they mean for your datacenter transformation journey.

Original Link

Streaming RNNs in TensorFlow

The Machine Learning team at Mozilla Research continues to work on an automatic speech recognition engine as part of Project DeepSpeech, which aims to make speech technologies and trained models openly available to developers. We’re hard at work improving performance and ease-of-use for our open source speech-to-text engine. The upcoming 0.2 release will include a much-requested feature: the ability to do speech recognition live, as the audio is being recorded. This blog post describes how we changed the STT engine’s architecture to allow for this, achieving real-time transcription performance. Soon, you’ll be able to transcribe audio at least as fast as it’s coming in.

When applying neural networks to sequential data like audio or text, it’s important to capture patterns that emerge over time. Recurrent neural networks (RNNs) are neural networks that “remember” — they take as input not just the next element in the data, but also a state that evolves over time, and use this state to capture time-dependent patterns. Sometimes, you may want to capture patterns that depend on future data as well. One of the ways to solve this is by using two RNNs, one that goes forward in time and one that goes backward, starting from the last element in the data and going to the first element. You can learn more about RNNs (and about the specific type of RNN used in DeepSpeech) in this article by Chris Olah.

Using a bidirectional RNN

The current release of DeepSpeech (previously covered on Hacks) uses a bidirectional RNN implemented with TensorFlow, which means it needs to have the entire input available before it can begin to do any useful work. One way to improve this situation is by implementing a streaming model: Do the work in chunks, as the data is arriving, so when the end of the input is reached, the model is already working on it and can give you results more quickly. You could also try to look at partial results midway through the input.

This animation shows how the data flows through the network. Data flows from the audio input to feature computation, through three fully connected layers. Then it goes through a bidirectional RNN layer, and finally through a final fully connected layer, where a prediction is made for a single time step.

This animation shows how the data flows through the network. Data flows from the audio input to feature computation, through three fully connected layers. Then it goes through a bidirectional RNN layer, and finally through a final fully connected layer, where a prediction is made for a single time step.

In order to do this, you need to have a model that lets you do the work in chunks. Here’s the diagram of the current model, showing how data flows through it.

As you can see, on the bidirectional RNN layer, the data for the very last step is required for the computation of the second-to-last step, which is required for the computation of the third-to-last step, and so on. These are the red arrows in the diagram that go from right to left.

We could implement partial streaming in this model by doing the computation up to layer three as the data is fed in. The problem with this approach is that it wouldn’t gain us much in terms of latency: Layers four and five are responsible for almost half of the computational cost of the model.

Using a unidirectional RNN for streaming

Instead, we can replace the bidirectional layer with a unidirectional layer, which does not have a dependency on future time steps. That lets us do the computation all the way to the final layer as soon as we have enough audio input.

With a unidirectional model, instead of feeding the entire input in at once and getting the entire output, you can feed the input piecewise. Meaning, you can input 100ms of audio at a time, get those outputs right away, and save the final state so you can use it as the initial state for the next 100ms of audio.

An alternative architecture that uses a unidirectional RNN in which each time step only depends on the input at that time and the state from the previous step.

An alternative architecture that uses a unidirectional RNN in which each time step only depends on the input at that time and the state from the previous step.

Here’s code for creating an inference graph that can keep track of the state between each input window:

import tensorflow as tf def create_inference_graph(batch_size=1, n_steps=16, n_features=26, width=64): input_ph = tf.placeholder(dtype=tf.float32, shape=[batch_size, n_steps, n_features], name='input') sequence_lengths = tf.placeholder(dtype=tf.int32, shape=[batch_size], name='input_lengths') previous_state_c = tf.get_variable(dtype=tf.float32, shape=[batch_size, width], name='previous_state_c') previous_state_h = tf.get_variable(dtype=tf.float32, shape=[batch_size, width], name='previous_state_h') previous_state = tf.contrib.rnn.LSTMStateTuple(previous_state_c, previous_state_h) # Transpose from batch major to time major input_ = tf.transpose(input_ph, [1, 0, 2]) # Flatten time and batch dimensions for feed forward layers input_ = tf.reshape(input_, [batch_size*n_steps, n_features]) # Three ReLU hidden layers layer1 = tf.contrib.layers.fully_connected(input_, width) layer2 = tf.contrib.layers.fully_connected(layer1, width) layer3 = tf.contrib.layers.fully_connected(layer2, width) # Unidirectional LSTM rnn_cell = tf.contrib.rnn.LSTMBlockFusedCell(width) rnn, new_state = rnn_cell(layer3, initial_state=previous_state) new_state_c, new_state_h = new_state # Final hidden layer layer5 = tf.contrib.layers.fully_connected(rnn, width) # Output layer output = tf.contrib.layers.fully_connected(layer5, ALPHABET_SIZE+1, activation_fn=None) # Automatically update previous state with new state state_update_ops = [ tf.assign(previous_state_c, new_state_c), tf.assign(previous_state_h, new_state_h) ] with tf.control_dependencies(state_update_ops): logits = tf.identity(logits, name='logits') # Create state initialization operations zero_state = tf.zeros([batch_size, n_cell_dim], tf.float32) initialize_c = tf.assign(previous_state_c, zero_state) initialize_h = tf.assign(previous_state_h, zero_state) initialize_state =, initialize_h, name='initialize_state') return { 'inputs': { 'input': input_ph, 'input_lengths': sequence_lengths, }, 'outputs': { 'output': logits, 'initialize_state': initialize_state, } }

The graph created by the code above has two inputs and two outputs. The inputs are the sequences and their lengths. The outputs are the logits and a special “initialize_state” node that needs to be run at the beginning of a new sequence. When freezing the graph, make sure you don’t freeze the state variables previous_state_h and previous_state_c.

Here’s code for freezing the graph:

from import freeze_graph freeze_graph.freeze_graph_with_def_protos( input_graph_def=session.graph_def, input_saver_def=saver.as_saver_def(), input_checkpoint=checkpoint_path, output_node_names='logits,initialize_state', restore_op_name=None, filename_tensor_name=None, output_graph=output_graph_path, initializer_nodes='', variable_names_blacklist='previous_state_c,previous_state_h')

With these changes to the model, we can use the following approach on the client side:

  1. Run the “initialize_state” node.
  2. Accumulate audio samples until there’s enough data to feed to the model (16 time steps in our case, or 320ms).
  3. Feed through the model, accumulate outputs somewhere.
  4. Repeat 2 and 3 until data is over.

It wouldn’t make sense to drown readers with hundreds of lines of the client-side code here, but if you’re interested, it’s all MPL 2.0 licensed and available on GitHub. We actually have two different implementations, one in Python that we use for generating test reports, and one in C++ which is behind our official client API.

Performance improvements

What does this all mean for our STT engine? Well, here are some numbers, compared with our current stable release:

  • Model size down from 468MB to 180MB
  • Time to transcribe: 3s file on a laptop CPU, down from 9s to 1.5s
  • Peak heap usage down from 4GB to 20MB (model is now memory-mapped)
  • Total heap allocations down from 12GB to 264MB

Of particular importance to me is that we’re now faster than real time without using a GPU, which, together with streaming inference, opens up lots of new usage possibilities like live captioning of radio programs, Twitch streams, and keynote presentations; home automation; voice-based UIs; and so on. If you’re looking to integrate speech recognition in your next project, consider using our engine!

Here’s a small Python program that demonstrates how to use libSoX to record from the microphone and feed it into the engine as the audio is being recorded.

import argparse
import deepspeech as ds
import numpy as np
import shlex
import subprocess
import sys parser = argparse.ArgumentParser(description='DeepSpeech speech-to-text from microphone')
parser.add_argument('--model', required=True, help='Path to the model (protocol buffer binary file)')
parser.add_argument('--alphabet', required=True, help='Path to the configuration file specifying the alphabet used by the network')
parser.add_argument('--lm', nargs='?', help='Path to the language model binary file')
parser.add_argument('--trie', nargs='?', help='Path to the language model trie file created with native_client/generate_trie')
args = parser.parse_args() LM_WEIGHT = 1.50
BEAM_WIDTH = 512 print('Initializing model...') model = ds.Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)
if args.lm and args.trie: model.enableDecoderWithLM(args.alphabet, args.lm, args.trie, LM_WEIGHT, VALID_WORD_COUNT_WEIGHT)
sctx = model.setupStream() subproc = subprocess.Popen(shlex.split('rec -q -V0 -e signed -L -c 1 -b 16 -r 16k -t raw - gain -2'), stdout=subprocess.PIPE, bufsize=0)
print('You can start speaking now. Press Control-C to stop recording.') try: while True: data = model.feedAudioContent(sctx, np.frombuffer(data, np.int16))
except KeyboardInterrupt: print('Transcription:', model.finishStream(sctx)) subproc.terminate() subproc.wait()

Finally, if you’re looking to contribute to Project DeepSpeech itself, we have plenty of opportunities. The codebase is written in Python and C++, and we would love to add iOS and Windows support, for example. Reach out to us via our IRC channel or our Discourse forum.

The post Streaming RNNs in TensorFlow appeared first on Mozilla Hacks – the Web developer blog.

Original Link

Briefing: Microsoft to set up Asia AI research branch in Shanghai

During the World AI Conference taking place in Shanghai, Microsoft announced they will launch Microsoft Research Asia’s Shanghai branch for AI. Original Link

Overscripted! Digging into JavaScript execution at scale

This research was conducted in partnership with the UCOSP (Undergraduate Capstone Open Source Projects) initiative. UCOSP facilitates open source software development by connecting Canadian undergraduate students with industry mentors to practice distributed development and data projects.

The team consisted of the following Mozilla staff: Martin Lopatka, David Zeber, Sarah Bird, Luke Crouch, Jason Thomas

2017 student interns — crawler implementation and data collection: Ruizhi You, Louis Belleville, Calvin Luo, Zejun (Thomas) Yu

2018 student interns — exploratory data analysis projects: Vivian Jin, Tyler Rubenuik, Kyle Kung, Alex McCallum

As champions of a healthy Internet, we at Mozilla have been increasingly concerned about the current advertisement-centric web content ecosystem. Web-based ad technologies continue to evolve increasingly sophisticated programmatic models for targeting individuals based on their demographic characteristics and interests. The financial underpinnings of the current system incentivise optimizing on engagement above all else. This, in turn, has evolved an insatiable appetite for data among advertisers aggressively iterating on models to drive human clicks.

Most of the content, products, and services we use online, whether provided by media organisations or by technology companies, are funded in whole or in part by advertising and various forms of marketing.

–Timothy Libert and Rasmus Kleis Nielsen [link]

We’ve talked about the potentially adverse effects on the Web’s morphology and how content silos can impede a diversity of viewpoints. Now, the Mozilla Systems Research Group is raising a call to action. Help us search for patterns that describe, expose, and illuminate the complex interactions between people and pages!

Inspired by the Web Census recently published by Steven Englehardt and Arvind Narayanan of Princeton University, we adapted the OpenWPM crawler framework to perform a comparable crawl gathering a rich set of information about the JavaScript execution on various websites. This enables us to delve into further analysis of web tracking, as well as a general exploration of client-page interactions and a survey of different APIs employed on the modern Web.

In short, we set out to explore the unseen or otherwise not obvious series of JavaScript execution events that are triggered once a user visits a webpage, and all the first- and third-party events that are set in motion when people retrieve content. To help enable more exploration and analysis, we are providing our full set of data about JavaScript executions open source.

The following sections will introduce the data set, how it was collected and the decisions made along the way. We’ll share examples of insights we’ve discovered and we’ll provide information on how to participate in the associated “Overscripted Web: A Mozilla Data Analysis Challenge”, which we’ve launched today with Mozilla’s Open Innovation Team.

The Dataset

In October 2017, several Mozilla staff and a group of Canadian undergraduate students forked the OpenWPM crawler repository to begin tinkering, in order to collect a plethora of information about the unseen interactions between modern websites and the Firefox web browser.

Preparing the seed list

The master list of pages we crawled in preparing the dataset was itself generated from a preliminary shallow crawl we performed in November 2017. We ran a depth-1 crawl, seeded by Alexa’s top 10,000 site list, using 4 different machines at 4 different IP addresses (all in residential non-Amazon IP addresses served by Canadian internet service providers). The crawl was implemented using the Requests Python library and collected no information except for an indication of successful page loads.

Of the 2,150,251 pages represented in the union of the 4 parallel shallow crawls, we opted to use the intersection of the four lists in order to prune out dynamically generated (e.g. personalized) outbound links that varied between them. This meant a reduction to 981,545 URLs, which formed the seed list for our main OpenWPM crawl.

The Main Collection

The following workflow describes (at a high level) the collection of page information contained in this dataset.

  1. Alexa top 10k (10,000 high traffic pages as of November 1st, 2017)
  2. Precrawl using the python Requests library, visits each one of those pages
    1. Request library requests that page
    2. That page sends a response
    3. All href tags in the response are captured to a depth of 1 (away from Alexa page)
      1. For each of those href tags all valid pages (starts with “http”) are added to the link set.
      2. The link set union (2,150,251) was examined using the request library in parallel, which gives us the intersection list of 981,545.
      3. The set of urls in the list 981,545 is passed to the deeper crawl for JavaScript analysis in a parallelized form.
  3. Each of these pages was sent to our adapted version of OpenWPM to have the execution of JavaScript recorded for 10 seconds.
  4. The window.location was hashed as the unique identifier of the location where the JavaScript was executed (to ensure unique source attribution).
    1. When OpenWPM hits content that is inside an iFrame, the location of the content is reported.
    2. Since we use the window.location to determine the location element of the content, each time an iFrame is encountered, that location can be split into the parent location of the page and the iFrame location.
    3. Data collection and aggregation performed through a websocket associates all the activity linked to a location hash for compilation of the crawl dataset.

Interestingly, for the Alexa top 10,000 sites, our depth-1 crawl yielded properties hosted on 41,166 TLDs across the union of our 4 replicates, whereas only 34,809 unique TLDs remain among the 981,545 pages belonging to their intersection.

A modified version of OpenWPM was used to record JavaScript calls potentially used for browsers tracking data from these pages. The collected JavaScript execution trace was written into an s3 bucket for later aggregation and analysis. Several additional parameters were defined based on cursory ad hoc analyses.

For example, the minimum dwell time per page required to capture the majority of JavaScript activity was set as 10 seconds per page. This was based on a random sampling of the seed list URLs and showed a large variation in time until no new JavaScript was being executed (from no JavaScript, to what appeared to be an infinite loop of self-referential JavaScript calls). This dwell time was chosen to balance between capturing the majority of JavaScript activity on a majority of the pages and minimizing the time required to complete the full crawl.

Several of the probes instrumented in the Data Leak repo were ported over to our hybrid crawler, including instrumentation to monitor JavaScript execution occuring inside an iFrame element (potentially hosted on a third-party domain). This would prove to provide much insight into the relationships between pages in the crawl data.

Exploratory work

In January 2018, we got to work analyzing the dataset we had created. After substantial data cleaning to work through the messiness of real world variation, we were left with a gigantic  Parquet dataset (around 70GB) containing an immense diversity of potential insights. Three example analyses are summarized below. The most important finding is that we have only just scratched the surface of the insights this data may hold.

Examining session replay activity

Session replay is a service that lets websites track users’ interactions with the page—from how they navigate the site, to their searches, to the input they provide. Think of it as a “video replay” of a user’s entire session on a webpage. Since some session replay providers may record personal information such as personal addresses, credit card information and passwords, this can present a significant risk to both privacy and security.

We explored the incidence of session replay usage, and a few associated features, across the pages in our crawl dataset. To identify potential session replay, we obtained the Princeton WebTAP project list, containing 14 Alexa top-10,000 session replay providers, and checked for calls to script URLs belonging to the list.

Out of 6,064,923 distinct script references among page loads in our dataset, we found 95,570 (1.6%) were to session replay providers. This translated to 4,857 distinct domain names (netloc) making such calls, out of a total of 87,325, or 5.6%. Note that even if scripts belonging to session replay providers are being accessed, this does not necessarily mean that session replay functionality is being used on the site.

Given the set of pages making calls to session replay providers, we also looked into the consistency of SSL usage across these calls. Interestingly, the majority of such calls were made over HTTPS (75.7%), and 49.9% of the pages making these calls were accessed over HTTPS. Additionally, we found no pages accessed over HTTPS making calls to session replay scripts over HTTP, which was surprising but encouraging.

Finally, we examined the distribution of TLDs across sites making calls to session replay providers, and compared this to TLDs over the full dataset. We found that, along with .com, .ru accounted for a surprising proportion of sites accessing such scripts (around 33%), whereas .ru domain names made up only 3% of all pages crawled. This implies that 65.6% of .ru sites in our dataset were making calls to potential session replay provider scripts. However, this may be explained by the fact that Yandex is one of the primary session replay providers, and it offers a range of other analytics services of interest to Russian-language websites.

Eval and dynamically created function calls

JavaScript allows a function call to be dynamically created from a string with the eval() function or by creating a new Function() object. For example, this code will print hello twice:

var my_func = new Function("console.log('hello')")

While dynamic function creation has its uses, it also opens up users to injection attacks, such as cross-site scripting, and can potentially be used to hide malicious code.

In order to understand how dynamic function creation is being used on the Web, we analyzed its prevalence, location, and distribution in our dataset. The analysis was initially performed on 10,000 randomly selected pages and validated against the entire dataset. In terms of prevalence, we found that 3.72% of overall function calls were created dynamically, and these originated from across 8.76% of the websites crawled in our dataset.

These results suggest that, while dynamic function creation is not used heavily, it is still common enough on the Web to be a potential concern. Looking at call frequency per page showed that, while some Web pages create all their function calls dynamically, the majority tend to have only 1 or 2 dynamically generated calls (which is generally 1-5% of all calls made by a page).

We also examined the extent of this practice among the scripts that are being called. We discovered that they belong to a relatively small subset of script hosts (at an average ratio of about 33 calls per URL), indicating that the same JavaScript files are being used by multiple webpages. Furthermore, around 40% of these are known trackers (identified using the disconnectme entity list), although only 33% are hosted on a different domain from the webpage that uses them. This suggests that web developers may not even know that they are using dynamically generated functions.


Cryptojacking refers to the unauthorized use of a user’s computer or mobile device to mine cryptocurrency. More and more websites are using browser-based cryptojacking scripts as cryptocurrencies rise in popularity. It is an easy way to generate revenue and a viable alternative to bloating a website with ads. An excellent contextualization of crypto-mining via client-side JavaScript execution can be found in the unabridged cryptojacking analysis prepared by Vivian Jin.

We investigated the prevalence of cryptojacking among the websites represented in our dataset. A list of potential cryptojacking hosts (212 sites total) was obtained from the adblock-nocoin-list GitHub repo. For each script call initiated on a page visit event, we checked whether the script host belonged to the list. Among 6,069,243 distinct script references on page loads in our dataset, only 945 (0.015%) were identified as cryptojacking hosts. Over half of these belonged to CoinHive, the original script developer. Only one use of AuthedMine was found. Viewed in terms of domains reached in the crawl, we found calls to cryptojacking scripts being made from 49 out of 29,483 distinct domains (0.16%).

However, it is important to note that cryptojacking code can be executed in other ways than by including the host script in a script tag. It can be disguised, stealthily executed in an iframe, or directly used in a function of a first-party script. Users may also face redirect loops that eventually lead to a page with a mining script. The low detection rate could also be due to the popularity of the sites covered by the crawl, which might  dissuade site owners from implementing obvious cryptojacking scripts. It is likely that the actual rate of cryptojacking is higher.

The majority of the domains we found using cryptojacking are streaming sites. This is unsurprising, as users have streaming sites open for longer while they watch video content, and mining scripts can be executed longer. A Chinese variety site called accounted for 207 out of the overall 945 cryptojacking script calls we found in our analysis, by far the largest domain we observed for cryptojacking calls.

Another interesting fact: although our cryptojacking host list contained 212 candidates, we found only 11 of them to be active in our dataset, or about 5%.

Limitations and future directions

While this is a rich dataset allowing for a number of interesting analyses, it is limited in visibility mainly to behaviours that occur via JS API calls.

Another feature we investigated using our dataset is the presence of Evercookies. Evercookies is a tracking tool used by websites to ensure that user data, such as a user ID, remains permanently stored on a computer. Evercookies persist in the browser by leveraging a series of tricks including Web API calls to a variety of available storage mechanisms. An initial attempt was made to search for evercookies in this data by searching for consistent values being passed to suspect Web API calls.

Acar et al., “The Web Never Forgets: Persistent Tracking Mechanisms in the Wild”, (2014) developed techniques for looking at evercookies at scale. First, they proposed a mechanism to detect identifiers. They applied this mechanism to HTTP cookies but noted that it could also be applied to other storage mechanisms, although some modification would be required. For example, they look at cookie expiration, which would not be applicable in the case of localStorage. For this dataset we could try replicating their methodology for set calls to window.document.cookie and window.localStorage.

They also looked at Flash cookies respawning HTTP cookies and HTTP respawning Flash cookies. Our dataset contains no information on the presence of Flash cookies, so additional crawls would be required to obtain this information. In addition, they used multiple crawls to study Flash respawning, so we would have to replicate that procedure.

In addition to our lack of information on Flash cookies, we have no information about HTTP cookies, the first mechanism by which cookies are set. Knowing which HTTP cookies are initially set can serve as an important complement and validation for investigating other storage techniques then used for respawning and evercookies.

Beyond HTTP and Flash, Samy Kamkar’s evercookie library documents over a dozen mechanisms for storing an id to be used as an evercookie. Many of these are not detectable by our current dataset, e.g. HTTP Cookies, HSTS Pinning, Flask Cookies, Silverlight Storage, ETags, Web cache, Internet Explorer userData storage, etc. An evaluation of the prevalence of each technique would be a useful contribution to the literature. We also see the value of an ongoing repeated crawl to identify changes in prevalence and accounting for new techniques as they are discovered.

However, it is possible to continue analyzing the current dataset for some of the techniques described by Samy. For example, caching is listed as a technique. We can look at this property in our dataset, perhaps by applying the same ID technique outlined by Acar et al., or perhaps by looking at sequences of calls.


Throughout our preliminary exploration of this data it became quickly apparent that the amount of superficial JavaScript execution on a Web page only tells part of the story. We have observed several examples of scripts running parallel to the content-serving functionality of webpages, these appear to fulfill a diversity of other functions. The analyses performed so far have led to some exciting discoveries, but so much more information remains hidden in the immense dataset available.

We are calling on any interested individuals to be part of the exploration. You’re invited to participate in the Overscripted Web: A Mozilla Data Analysis Challenge and help us better understand some of the hidden workings of the modern Web!


Extra special thanks to Steven Englehardt for his contributions to the OpenWPM tool and advice throughout this project. We also thank Havi Hoffman for valuable editorial contributions to earlier versions of this post. Finally, thanks to Karen Reid of University of Toronto for coordinating the UCOSP program.

More articles by Martin Lopatka…

Original Link

Insights from the Developer Community

Early in my career, I worked in product management and interacted with a team of highly skilled developers on a platform service. Although there were a few layers between me and the actual developers, I still had an opportunity to get to know them and become friends with them.

Some people have a misconception and tend to generalize that developers are a unique group, using words like introverts to describe them. Shows like Mr. Robot certainly reinforce this image.

Image title

However, my interactions and my perception is different, so I conducted my own research and surveyed the developer community within my network.

Below are some of the insights from the developer audience along with added summaries from research done by Accenture, other consulting firms, Slash Data, and Stackoverflow’s annual developer survey results of over 100,000 developers.

The People Behind the “Dev Team”

When you ask developers what their interests are, you get a sense of adventure from them. For example, developers like to explore whether it be kayaking or trekking and hiking. These are just a few of the responses I collected. Other activities included going hunting and going to the gym.

I also asked what type of movies or books or music they enjoyed, and had a broad spectrum from suspense to history.

Of course, foosball and gaming were a part of the responses from the developer audience.

Other Developer Audience Insights

In a survey conducted by Stackoverflow, I came across of number of observations and findings.

From a skillset perspective, DevOps and ML (machine learning) are important skills, in demand, and well paying. The convergence of disciplines is what will set developers apart in the future. We are already discussing how AI (artificial intelligence) will impact every job out there including coding, so someone that has multiple skills will stand out. Understanding the software and coding, as well as the device and hardware, along with the processes gives a unique perspective that makes a developer more efficient and effective.

Of course, machine learning and AI are highly desirable and rare skills right now and the large technology companies aren’t the only ones building large teams of data scientists and high-end developers.

At the coding level, the survey found that interest in Python went up, and surpassed C# and PHP. ML is playing a factor in the rise in popularity. Again, understanding multiple languages and lower level code that can interact with devices and hardware is a trend that is continuing to grow. Most developers spend their own time learning additional coding languages since this is a fast-paced environment and becoming more competitive.

From an area of focus, most developers identified themselves as back-end developers. Half also identified as full-stack. What was interesting is that despite the surge in mobile usage, hardly 1 in 5 developers associates themselves as a mobile developer. Could this be an opportunity to specialize and stand out?

Something that shouldn’t be a surprise is the use of Git for version control. Most developers have adopted this as a standard. For dev environments, the leader by far is Visual Studio.

Most developers use one of four platforms. While large companies have embraced one of the big 3 public cloud providers, internal apps tend to still be self-hosted. The four platforms include internal systems, Microsoft Azure, AWS, and Google Cloud Platform [source: Slash Data]. The type of architecture is changing too. 20% of developers are using serverless and 20% are using virtual machines.

Developer Perspective on Future

Developers also have visibility into the future of technology since they are tinkering with, and building what the rest of the world will see tomorrow. This includes the impact of AI and driverless cars, AR and VR (augmented reality and virtual reality).

What’s interesting to tease out from this is the impact of ethics and AI. Whether it is driverless cars or other uses of AI, these are very powerful capabilities. Questions about who will decide the parameters, and whether it will be regulated are still unanswered.

Who decides when a driverless car is facing an impending accident and whether it will choose to injure party A or party B—what is the logic that will calculate that decision? In less drastic scenarios, will it be left to the developer and their morals and ethics to set parameters for how a machine makes decisions?

This is just an example of how the role of the developer is becoming very central within organizations, not just technology centric, but even within society.

Original Link

OPPO launches new research institute to boost capability in 5G, AI, and image processing

OPPO launches new research institute to boost capability in 5G, AI, and image processing · TechNode

Original Link

A Journey to <10% Word Error Rate

At Mozilla, we believe speech interfaces will be a big part of how people interact with their devices in the future. Today we are excited to announce the initial release of our open source speech recognition model so that anyone can develop compelling speech experiences.

The Machine Learning team at Mozilla Research has been working on an open source Automatic Speech Recognition engine modeled after the Deep Speech papers (1, 2) published by Baidu. One of the major goals from the beginning was to achieve a Word Error Rate in the transcriptions of under 10%. We have made great progress: Our word error rate on LibriSpeech’s test-clean set is 6.5%, which not only achieves our initial goal, but gets us close to human level performance.

This post is an overview of the team’s efforts and ends with a more detailed explanation of the final piece of the puzzle: the CTC decoder.

The architecture

Deep Speech is an end-to-end trainable, character-level, deep recurrent neural network (RNN). In less buzzwordy terms: it’s a deep neural network with recurrent layers that gets audio features as input and outputs characters directly — the transcription of the audio. It can be trained using supervised learning from scratch, without any external “sources of intelligence”, like a grapheme to phoneme converter or forced alignment on the input.

This animation shows how the data flows through the network. Data flows from the audio input to the feature computation, through three initial feed forward layers, then through a bidirectional RNN layer, and finally through the final softmax layer, where a character is predicted.

This animation shows how the data flows through the network. In practice, instead of processing slices of the audio input individually, we do all slices at once.

The network has five layers: the input is fed into three fully connected layers, followed by a bidirectional RNN layer, and finally a fully connected layer. The hidden fully connected layers use the ReLU activation. The RNN layer uses LSTM cells with tanh activation.

The output of the network is a matrix of character probabilities over time. In other words, for each time step the network outputs one probability for each character in the alphabet, which represents the likelihood of that character corresponding to what’s being said in the audio at that time. The CTC loss function (PDF link) considers all alignments of the audio to the transcription at the same time, allowing us to maximize the probability of the correct transcription being predicted without worrying about alignment. Finally, we train using the Adam optimizer.

The data

Supervised learning requires data, lots and lots of it. Training a model like Deep Speech requires thousands of hours of labeled audio, and obtaining and preparing this data can be as much work, if not more, as implementing the network and the training logic.

We started by downloading freely available speech corpora like TED-LIUM and LibriSpeech,, as well as acquiring paid corpora like Fisher and Switchboard. We wrote importers in Python for the different data sets that convert the audio files to WAV, split the audio and cleaned up the transcription of unneeded characters like punctuation and accents. Finally we stored the preprocessed data in CSV files that can be used to feed data into the network.

Using existing speech corpora allowed us to quickly start working on the model. But in order to achieve excellent results, we needed a lot more data. We had to be creative. We thought that maybe this type of speech data would already exist out there, sitting in people’s archives, so we reached out to public TV and radio stations, language study departments in universities, and basically anyone who might have labeled speech data to share. Through this effort, we were able to more than double the amount of training data we had to work with, which is now enough for training a high-quality English model.

Having a high-quality voice corpus publicly available not only helps advance our own speech recognition engine. It will eventually allow for broad innovation because developers, startups and researchers around can train and experiment with different architectures and models for different languages. It could help democratize access to Deep Learning for those who can’t afford to pay for thousands of hours of training data (almost everyone).

To build a speech corpus that’s free, open source, and big enough to create meaningful products with, we worked with Mozilla’s Open Innovation team and launched the Common Voice project to collect and validate speech contributions from volunteers all over the world. Today, the team is releasing a large collection of voice data into the public domain. Find out more about the release on the Open Innovation Medium blog.

The hardware

Deep Speech has over 120 million parameters, and training a model this large is a very computationally expensive task: you need lots of GPUs if you don’t want to wait forever for results. We looked into training on the cloud, but it doesn’t work financially: dedicated hardware pays for itself quite quickly if you do a lot of training. The cloud is a good way to do fast hyperparameter explorations though, so keep that in mind.

We started with a single machine running four Titan X Pascal GPUs, and then bought another two servers with 8 Titan XPs each. We run the two 8 GPU machines as a cluster, and the older 4 GPU machine is left independent to run smaller experiments and test code changes that require more compute power than our development machines have. This setup is fairly efficient, and for our larger training runs we can go from zero to a good model in about a week.

Setting up distributed training with TensorFlow was an arduous process. Although it has the most mature distributed training tools of the available deep learning frameworks, getting things to actually work without bugs and to take full advantage of the extra compute power is tricky. Our current setup works thanks to the incredible efforts of my colleague Tilman Kamp, who endured long battles with TensorFlow, Slurm, and even the Linux kernel until we had everything working.

Putting it all together

At this point, we have two papers to guide us, a model implemented based on those papers, the resulting data, and the hardware required for the training process. It turns out that replicating the results of a paper isn’t that straightforward. The vast majority of papers don’t specify all the hyperparameters they use, if they specify any at all. This means you have to spend a whole lot of time and energy doing hyperparameter searches to find a good set of values. Our initial tests with values chosen through a mix of randomness and intuition weren’t even close to the ones reported by the paper, probably due to small differences in the architecture — for one, we used LSTM (Long short-term memory) cells instead of GRU (gated recurrent unit) cells. We spent a lot of time doing a binary search on dropout ratios, we reduced the learning rate, changed the way the weights were initialized, and experimented with the size of the hidden layers as well. All of those changes got us pretty close to our desired target of <10% Word Error Rate, but not there.

One piece missing from our code was an important optimization: integrating our language model into the decoder. The CTC (Connectionist Temporal Classification) decoder works by taking the probability matrix that is output by the model and walking over it looking for the most likely text sequence according to the probability matrix. If at time step 0 the letter “C” is the most likely, and at time step 1 the letter “A” is the most likely, and at time step 2 the letter “T” is the most likely, then the transcription given by the simplest possible decoder will be “CAT”. This strategy is called greedy decoding.

A cat with icons of pause buttons where its paws would be.

This is a pretty good way of decoding the probabilities output by the model into a sequence of characters, but it has one major flaw: it only takes into account the output of the network, which means it only takes into account the information from audio. When the same audio has two equally likely transcriptions (think “new” vs “knew”, “pause” vs “paws”), the model can only guess at which one is correct. This is far from optimal: if the first four words in a sentence are “the cat has tiny”, we can be pretty sure that the fifth word will be “paws” rather than “pause”. Answering those types of questions is the job of a language model, and if we could integrate a language model into the decoding phase of our model, we could get way better results.

When we first tried to tackle this issue, we ran into a couple of blockers in TensorFlow: first, it doesn’t expose its beam scoring functionality in the Python API (probably for performance reasons); and second, the log probabilities output by the CTC loss function were (are?) invalid.

We decided to work around the problem by building something like a spell checker instead: go through the transcription and see if there are any small modifications we can make that increase the likelihood of that transcription being valid English, according to the language model. This did a pretty good job of correcting small spelling mistakes in the output, but as we got closer and closer to our target error rate, we realized that it wasn’t going to be enough. We’d have to bite the bullet and write some C++.

Beam scoring with a language model

Integrating the language model into the decoder involves querying the language model every time we evaluate an addition to the transcription. Going back to the previous example, when looking into whether we want to choose “paws” or “pause” for the next word after “the cat has tiny”, we query the language model and use that score as a weight to sort the candidate transcriptions. Now we get to use information not just from audio but also from our language model to decide which transcription is more likely. The algorithm is described in this paper by Hannun et. al.

Luckily, TensorFlow does have an extension point on its CTC beam search decoder that allows the user to supply their own beam scorer. This means all you have to do is write the beam scorer that queries the language model and plug that in. For our case, we wanted that functionality to be exposed to our Python code, so we also exposed it as a custom TensorFlow operation that can be loaded using tf.load_op_library.

Getting all of this to work with our setup required quite a bit of effort, from fighting with the Bazel build system for hours, to making sure all the code was able to handle Unicode input in a consistent way, and debugging the beam scorer itself. The system requires quite a few pieces to work together:

  • The language model itself (we use KenLM for building and querying).
  • A trie of all the words in our vocabulary.
  • An alphabet file that maps integer labels output by the network into characters.

Although adding this many moving parts does make our code harder to modify and apply to different use cases (like other languages), it brings great benefits: Our word error rate on LibriSpeech’s test-clean set went from 16% to 6.5%, which not only achieves our initial goal, but gets us close to human level performance (5.83% according to the Deep Speech 2 paper). On a MacBook Pro, using the GPU, the model can do inference at a real-time factor of around 0.3x, and around 1.4x on the CPU alone. (A real-time factor of 1x means you can transcribe 1 second of audio in 1 second.)

It has been an incredible journey to get to this place: the initial release of our model! In the future we want to release a model that’s fast enough to run on a mobile device or a Raspberry Pi.

If this type of work sounds interesting or useful to you, come check out our repository on GitHub and our Discourse channel. We have a growing community of contributors and we’re excited to help you create and publish a model for your language.

Reuben is an engineer on the Machine Learning group at Mozilla Research.

More articles by Reuben Morais…

Original Link